Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6664

Replace SynonymFilter with SynonymGraphFilter

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.4, 7.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff from LUCENE-6582.

      I created a new SynonymGraphFilter (to replace the current buggy
      SynonymFilter), that produces correct graphs (does no "graph
      flattening" itself). I think this makes it simpler.

      This means you must add the FlattenGraphFilter yourself, if you are
      applying synonyms during indexing.

      Index-time syn expansion is a necessarily "lossy" graph transformation
      when multi-token (input or output) synonyms are applied, because the
      index does not store posLength, so there will always be phrase
      queries that should match but do not, and then phrase queries that
      should not match but do.
      http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
      goes into detail about this.

      However, with this new SynonymGraphFilter, if instead you do synonym
      expansion at query time (and don't do the flattening), and you use
      TermAutomatonQuery (future: somehow integrated into a query parser),
      or maybe just "enumerate all paths and make union of PhraseQuery", you
      should get 100% correct matches (not sure about "proper" scoring
      though...).

      This new syn filter still cannot consume an arbitrary graph.

      1. LUCENE-6664.patch
        153 kB
        Michael McCandless
      2. LUCENE-6664.patch
        156 kB
        Michael McCandless
      3. LUCENE-6664.patch
        169 kB
        Michael McCandless
      4. LUCENE-6664.patch
        154 kB
        Michael McCandless
      5. LUCENE-6664.patch
        71 kB
        Michael McCandless
      6. usa_flat.png
        32 kB
        Michael McCandless
      7. usa.png
        36 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          mikemccand Michael McCandless added a comment -

          Patch, still work in progress. It includes the FlattenGraphFilter
          from LUCENE-6638.

          I put everything in sandbox for now, so I could add a test case that
          TermAutomatonQuery works correctly for query-time syn expansion. But
          this added a dep from sandbox on analyzers ... I think I'll move the
          new filters back to analyzers module and comment on the TAQ test case
          as an example.

          Show
          mikemccand Michael McCandless added a comment - Patch, still work in progress. It includes the FlattenGraphFilter from LUCENE-6638 . I put everything in sandbox for now, so I could add a test case that TermAutomatonQuery works correctly for query-time syn expansion. But this added a dep from sandbox on analyzers ... I think I'll move the new filters back to analyzers module and comment on the TAQ test case as an example.
          Hide
          mikemccand Michael McCandless added a comment -

          Example syn graph, and flattened version.

          Show
          mikemccand Michael McCandless added a comment - Example syn graph, and flattened version.
          Hide
          mikemccand Michael McCandless added a comment -

          New patch, fixing all nocommits, folding in all the nice test cases from LUCENE-6582 (thanks Ian Ribas!), fixing some offsets bugs.

          I think it's finally ready. This issue absorbs LUCENE-6638.

          I also wrote a fun test method (toDot(TokenStream)) that converts a TokenStream to a dot file which you can then render with graphviz. E.g. here's the un-flattened expansion for various syns of usa:

          and the corresponding flattened version:

          (red arcs are the inserted synonym tokens)

          With SynonymGraphFilter, multi token synonyms can finally be correctly represented in the token stream, and using query-time synonyms with either TermAutomatonQuery or some other approach (e.g. expanding all paths and making OR of PhraseQuery), the correct results should be returned. Index-time synonyms will still be incorrect (fail to match some phrase queries, and incorrectly match other phrase queries) since we don't index the PosLenAttribute.

          Show
          mikemccand Michael McCandless added a comment - New patch, fixing all nocommits, folding in all the nice test cases from LUCENE-6582 (thanks Ian Ribas !), fixing some offsets bugs. I think it's finally ready. This issue absorbs LUCENE-6638 . I also wrote a fun test method ( toDot(TokenStream) ) that converts a TokenStream to a dot file which you can then render with graphviz. E.g. here's the un-flattened expansion for various syns of usa: and the corresponding flattened version: (red arcs are the inserted synonym tokens) With SynonymGraphFilter , multi token synonyms can finally be correctly represented in the token stream, and using query-time synonyms with either TermAutomatonQuery or some other approach (e.g. expanding all paths and making OR of PhraseQuery), the correct results should be returned. Index-time synonyms will still be incorrect (fail to match some phrase queries, and incorrectly match other phrase queries) since we don't index the PosLenAttribute.
          Hide
          rcmuir Robert Muir added a comment -

          So is positionlength still (ab)used as really a node ID? Should it use another attribute instead?

          Show
          rcmuir Robert Muir added a comment - So is positionlength still (ab)used as really a node ID? Should it use another attribute instead?
          Hide
          mikemccand Michael McCandless added a comment -

          Rob had a good idea here, to make both SynonymGraphFilter and FlattenGraphFilter package private, and only expose SynonymFilter which under the hood is appending both of these filters.

          This is safer (you can't forget to append the FlatttenGraphFilter), but it means users can't play with query-time synonyms until we open this up again. But I think it's a good baby step ... I'll explore it.

          Show
          mikemccand Michael McCandless added a comment - Rob had a good idea here, to make both SynonymGraphFilter and FlattenGraphFilter package private, and only expose SynonymFilter which under the hood is appending both of these filters. This is safer (you can't forget to append the FlatttenGraphFilter), but it means users can't play with query-time synonyms until we open this up again. But I think it's a good baby step ... I'll explore it.
          Hide
          mikemccand Michael McCandless added a comment -

          So is positionlength still (ab)used as really a node ID?

          Yeah: it "generalizes" position as a nodeID, like Kuromoji.

          Should it use another attribute instead?

          Maybe we could do that? FromNodeAtt/ToNodeAtt? And fix SausageFilter to only look at those? Since we'll make both of these package private it would be completely safe, at least until we open them up for query-time synonyms...

          Show
          mikemccand Michael McCandless added a comment - So is positionlength still (ab)used as really a node ID? Yeah: it "generalizes" position as a nodeID, like Kuromoji. Should it use another attribute instead? Maybe we could do that? FromNodeAtt/ToNodeAtt? And fix SausageFilter to only look at those? Since we'll make both of these package private it would be completely safe, at least until we open them up for query-time synonyms...
          Hide
          mikemccand Michael McCandless added a comment -

          For query-time integration (a separate future issue!) I think we could use the existing TokenStreamToTermAutomatonQuery, either as-is (producing a TermAutomatonQuery by iterating all tokens from a TokenStream), or forked to instead produce union of PhraseQuery, or MultiPhraseQuery if the graph "matches".

          Show
          mikemccand Michael McCandless added a comment - For query-time integration (a separate future issue!) I think we could use the existing TokenStreamToTermAutomatonQuery, either as-is (producing a TermAutomatonQuery by iterating all tokens from a TokenStream), or forked to instead produce union of PhraseQuery, or MultiPhraseQuery if the graph "matches".
          Hide
          rcmuir Robert Muir added a comment -

          Yeah: it "generalizes" position as a nodeID, like Kuromoji.

          Wait i'm confused, positions from kuromoji always seem normal to me. I think the difference is that its decompounding is a subset of the things that synonymfilter can do?

          Show
          rcmuir Robert Muir added a comment - Yeah: it "generalizes" position as a nodeID, like Kuromoji. Wait i'm confused, positions from kuromoji always seem normal to me. I think the difference is that its decompounding is a subset of the things that synonymfilter can do?
          Hide
          mikemccand Michael McCandless added a comment -

          I think the difference is that its decompounding is a subset of the things that synonymfilter can do?

          Hmm you're right: when Kuromoji makes a "side path", it's always just a single token.

          I.e., all it really "plays with" is posLenAtt, I think. So it also still only makes sausage graphs...

          Show
          mikemccand Michael McCandless added a comment - I think the difference is that its decompounding is a subset of the things that synonymfilter can do? Hmm you're right: when Kuromoji makes a "side path", it's always just a single token. I.e., all it really "plays with" is posLenAtt, I think. So it also still only makes sausage graphs...
          Hide
          mikemccand Michael McCandless added a comment -

          New patch with Rob's idea: I made the new SynonymGraphFilter and
          SausageFilter package private, and replaced the old SynonymFilter with
          these two filters.

          But TestSynonymMapFilter (the existing unit test) fails, because there
          are some changes in behavior with the new filter:

          • Syn output order is different: with the new syn filter, the syn
            comes out before the original token. This is necessary to ensure
            offsets never go backwards...
          • When there are more output tokens for a syn than input tokens,
            then new syn filter makes new positions for the extra tokens, but
            the old one didn't.
          • The new syn filter does more captureState() calls

          I think we need to keep the old behavior available, maybe using a
          Version constant or a separate class (SynFilterPre53,
          LegacySynFilter) or something?

          Show
          mikemccand Michael McCandless added a comment - New patch with Rob's idea: I made the new SynonymGraphFilter and SausageFilter package private, and replaced the old SynonymFilter with these two filters. But TestSynonymMapFilter (the existing unit test) fails, because there are some changes in behavior with the new filter: Syn output order is different: with the new syn filter, the syn comes out before the original token. This is necessary to ensure offsets never go backwards... When there are more output tokens for a syn than input tokens, then new syn filter makes new positions for the extra tokens, but the old one didn't. The new syn filter does more captureState() calls I think we need to keep the old behavior available, maybe using a Version constant or a separate class (SynFilterPre53, LegacySynFilter) or something?
          Hide
          mikemccand Michael McCandless added a comment -

          Thinking about this more, I think it's sort of silly to only expose the "always flattened synonym filter" because that's really no better than today: you still can't search multi-token synonyms correctly, and there are small differences in how the graph corruption happens.

          So I'd like to go back to the 2nd patch, where both filters are public. I'll mark them experimental...

          Show
          mikemccand Michael McCandless added a comment - Thinking about this more, I think it's sort of silly to only expose the "always flattened synonym filter" because that's really no better than today: you still can't search multi-token synonyms correctly, and there are small differences in how the graph corruption happens. So I'd like to go back to the 2nd patch, where both filters are public. I'll mark them experimental...
          Hide
          mikemccand Michael McCandless added a comment -

          New patch, making the new filters public and experimental again.

          I also improved the naming.

          Robert Muir is this OK? Or do you think which attributes to use should block committing this? I can also put this in sandbox?

          Show
          mikemccand Michael McCandless added a comment - New patch, making the new filters public and experimental again. I also improved the naming. Robert Muir is this OK? Or do you think which attributes to use should block committing this? I can also put this in sandbox?
          Hide
          rcmuir Robert Muir added a comment -

          I dont agree with changing the.meqning of these posinc/poslen atts in this way. So i dont understand the point of committjng it this way.

          This isnt a contajned commit, it tries to alter their semantics. Where zometime its now this crazy nodeid. Positions are.now ununderstandable in our analysis api. Is this what we want?

          Show
          rcmuir Robert Muir added a comment - I dont agree with changing the.meqning of these posinc/poslen atts in this way. So i dont understand the point of committjng it this way. This isnt a contajned commit, it tries to alter their semantics. Where zometime its now this crazy nodeid. Positions are.now ununderstandable in our analysis api. Is this what we want?
          Hide
          mikemccand Michael McCandless added a comment -

          I dont agree with changing the.meqning of these posinc/poslen atts in this way.

          Actually, I think this is a natural generalization of the existing posInc/posLen atts, and we should let graph filters use these attributes.

          I think making separate "graph attributes" is a short term hack that will just cause future problems.

          Won't it make interop b/w "graph" and "non-graph" filters a nightmare? E.g. we'd need separate StopGraphFilter and a StopFilter, LengthGraphFilter/LengthFilter, etc.?

          So i dont understand the point of committjng it this way.

          OK I won't commit this.

          Really until we have a complete query-time solution here, I suppose this patch is essentially pointless.

          But I was hoping by making it available, other devs would see it and try to innovate on the query-time side of things.

          Are you also -1 on committing this to sandbox?

          Show
          mikemccand Michael McCandless added a comment - I dont agree with changing the.meqning of these posinc/poslen atts in this way. Actually, I think this is a natural generalization of the existing posInc/posLen atts, and we should let graph filters use these attributes. I think making separate "graph attributes" is a short term hack that will just cause future problems. Won't it make interop b/w "graph" and "non-graph" filters a nightmare? E.g. we'd need separate StopGraphFilter and a StopFilter, LengthGraphFilter/LengthFilter, etc.? So i dont understand the point of committjng it this way. OK I won't commit this. Really until we have a complete query-time solution here, I suppose this patch is essentially pointless. But I was hoping by making it available, other devs would see it and try to innovate on the query-time side of things. Are you also -1 on committing this to sandbox?
          Hide
          rcmuir Robert Muir added a comment -

          OK, if you say its a generalization, then I am ok. But you are saying current code in queryparsers won't do the wrong thing?: I know its a tough one, since it already is somewhat wrong today!? I'm just asking because we dont want to make it worse or more confusing.

          Show
          rcmuir Robert Muir added a comment - OK, if you say its a generalization, then I am ok. But you are saying current code in queryparsers won't do the wrong thing?: I know its a tough one, since it already is somewhat wrong today!? I'm just asking because we dont want to make it worse or more confusing.
          Hide
          mikemccand Michael McCandless added a comment -

          I pushed this out to 5.4, we shouldn't shove it in to 5.3.

          Thanks Robert Muir, I'll try to update the javadocs for posInc/posLenAtt.

          I think the current code in queryparsers simply ignores posLen, right? (just like indexer) Which should "just" mean positions get messed up in the same way as indexer, so some phrase queries will work but others won't...

          Show
          mikemccand Michael McCandless added a comment - I pushed this out to 5.4, we shouldn't shove it in to 5.3. Thanks Robert Muir , I'll try to update the javadocs for posInc/posLenAtt. I think the current code in queryparsers simply ignores posLen, right? (just like indexer) Which should "just" mean positions get messed up in the same way as indexer, so some phrase queries will work but others won't...
          Hide
          rcmuir Robert Muir added a comment -

          but they use posInc! so i think its important not to break the semantics. I'm super-against changing this to "nodeID" because it will break code like that. We need a new attribute for that.

          Show
          rcmuir Robert Muir added a comment - but they use posInc! so i think its important not to break the semantics. I'm super-against changing this to "nodeID" because it will break code like that. We need a new attribute for that.
          Hide
          mikemccand Michael McCandless added a comment -

          I'm super-against changing this to "nodeID" because it will break code like that. We need a new attribute for that.

          OK I won't commit this.

          I'll open a new issue to figure out whether/how/should we can implement graph filters...

          Really, unless we can figure out how query parsers can make use of graph filters like this one, there's not much point in committing this.

          Show
          mikemccand Michael McCandless added a comment - I'm super-against changing this to "nodeID" because it will break code like that. We need a new attribute for that. OK I won't commit this. I'll open a new issue to figure out whether/how/should we can implement graph filters... Really, unless we can figure out how query parsers can make use of graph filters like this one, there's not much point in committing this.
          Hide
          mikemccand Michael McCandless added a comment -

          We can't move forward here until we figure out how to keep back-compat ...

          Show
          mikemccand Michael McCandless added a comment - We can't move forward here until we figure out how to keep back-compat ...
          Hide
          mdavis95 Matt Davis added a comment -

          Could this be a candidate for 6.0 (breaking back-compat)?

          Show
          mdavis95 Matt Davis added a comment - Could this be a candidate for 6.0 (breaking back-compat)?
          Hide
          mikemccand Michael McCandless added a comment -

          Could this be a candidate for 6.0 (breaking back-compat)?

          Probably not ... I think Robert Muir's objections would still hold for 6.0 as well?

          The patch is fairly standalone so you could just use this new token filter (SynonymGraphFilter) in your own world?

          Show
          mikemccand Michael McCandless added a comment - Could this be a candidate for 6.0 (breaking back-compat)? Probably not ... I think Robert Muir 's objections would still hold for 6.0 as well? The patch is fairly standalone so you could just use this new token filter ( SynonymGraphFilter ) in your own world?
          Hide
          paul.elschot@xs4all.nl Paul Elschot added a comment -

          From the SausageGraphFilter: Lucene cannot yet index an arbitrary token graph.

          Perhaps positional joins (LUCENE-5627) can help here.This indexes joins between non-decreasing positions of any field in the same document, and allows the joins to be queried. However I have the impression that these positional joins bring more complexity than what is needed here.

          One basic mechanism for the positional joins is a non decreasing series of positions. (Currently these are in payloads, I'm considering docvalues). These are accessed by both index and value, and used at query time to jump for example from one field to another.
          Another basic mechanism there is a hierarchy between the positions of a single field, for example for nested XML element names. This hierarchy is probably too restrictive here.

          How arbitrary are the token graphs here?

          Show
          paul.elschot@xs4all.nl Paul Elschot added a comment - From the SausageGraphFilter: Lucene cannot yet index an arbitrary token graph. Perhaps positional joins ( LUCENE-5627 ) can help here.This indexes joins between non-decreasing positions of any field in the same document, and allows the joins to be queried. However I have the impression that these positional joins bring more complexity than what is needed here. One basic mechanism for the positional joins is a non decreasing series of positions. (Currently these are in payloads, I'm considering docvalues). These are accessed by both index and value, and used at query time to jump for example from one field to another. Another basic mechanism there is a hierarchy between the positions of a single field, for example for nested XML element names. This hierarchy is probably too restrictive here. How arbitrary are the token graphs here?
          Hide
          mikemccand Michael McCandless added a comment -

          How arbitrary are the token graphs here?

          Well, the token graphs produced by SynonymGraphFilter (this patch) don't really have any more expressibility over what existing tokenizers can do (e.g. Kuromoji/JapaneseTokenizer). They are acyclic, and are enumerated under similar constraints as the automaton API, i.e. you must add all tokens leaving a given position before moving to the next position.

          I think what must be controversial about this patch is that this new filter can create new positions, which is necessary to fix bugs in the old SynonymFilter to correctly handle syns that expand to more tokens than they consumed, e.g. "dns -> domain name system". Because otherwise you cannot distinguish the output of SynonymGraphFilter from e.g. JapaneseTokenizer: they both produce graphs with side paths.

          Probably LUCENE-5012 is the only realistic way to move forward here: the synonym filter on that branch fixes the bugs that SynonymGraphFilter (this patch) also fixes, and then fixes additional bugs so that it can consume an incoming graph as well. E.g. on that branch you could apply synonyms to the graph output from JapaneseTokenizer).

          Show
          mikemccand Michael McCandless added a comment - How arbitrary are the token graphs here? Well, the token graphs produced by SynonymGraphFilter (this patch) don't really have any more expressibility over what existing tokenizers can do (e.g. Kuromoji/JapaneseTokenizer). They are acyclic, and are enumerated under similar constraints as the automaton API, i.e. you must add all tokens leaving a given position before moving to the next position. I think what must be controversial about this patch is that this new filter can create new positions, which is necessary to fix bugs in the old SynonymFilter to correctly handle syns that expand to more tokens than they consumed, e.g. "dns -> domain name system". Because otherwise you cannot distinguish the output of SynonymGraphFilter from e.g. JapaneseTokenizer : they both produce graphs with side paths. Probably LUCENE-5012 is the only realistic way to move forward here: the synonym filter on that branch fixes the bugs that SynonymGraphFilter (this patch) also fixes, and then fixes additional bugs so that it can consume an incoming graph as well. E.g. on that branch you could apply synonyms to the graph output from JapaneseTokenizer ).
          Hide
          rcmuir Robert Muir added a comment -

          I thought I already explained my concerns well. I guess I will try yet again...

          • I think its a huge, huge break to modify the semantics of existing token attributes like position increment, to mean something very different (nodeID), that seems like an awkward fit, wedged in.
          • If position increment/length no longer have meaning and are just confusing ways of storing node ID, that's a huge usability issue: lets use a clean slate of attributes instead of abusing in that way...
          • I'm concerned about complexity: today the simple case (posInc + posLen) is hard enough to understand, but doable. I feel like going this route, pushes the complexity onto the simple case, and that worries me a lot. A lot of people just simply will not have graphs! Should all of our analysis components really be required to support them? Maybe we should really be putting the effort into the ability to use alternative analysis APIs, so that we can have a "graph analysis" api that just works totally different from the tokenstream one, and so on.
          Show
          rcmuir Robert Muir added a comment - I thought I already explained my concerns well. I guess I will try yet again... I think its a huge, huge break to modify the semantics of existing token attributes like position increment, to mean something very different (nodeID), that seems like an awkward fit, wedged in. If position increment/length no longer have meaning and are just confusing ways of storing node ID, that's a huge usability issue: lets use a clean slate of attributes instead of abusing in that way... I'm concerned about complexity: today the simple case (posInc + posLen) is hard enough to understand, but doable. I feel like going this route, pushes the complexity onto the simple case, and that worries me a lot. A lot of people just simply will not have graphs! Should all of our analysis components really be required to support them? Maybe we should really be putting the effort into the ability to use alternative analysis APIs, so that we can have a "graph analysis" api that just works totally different from the tokenstream one, and so on.
          Hide
          jkrupan Jack Krupansky added a comment -

          Hey Michael McCandless, don't get discouraged, this was a very valuable exercise. I am a solid proponent of getting multi-term synonyms working in a full and robust manner, but I recognize that they just don't fit in cleanly with the existing flat token stream architecture. That's life. In any case, don't give up on this long-term effort.

          Maybe the best thing for now is to retain the traditional flat synonym filter for compatibility, fully add the new SynonymGraphFilter, and then add the optional ability to enable graph support in the main Lucene query parser. (Alas, Solr, has its own fork of the Lucene query parser.) Support within phrase queries is the tricky part.

          It would also be good to address the issue with non-phrase terms being analyzed separately - the query parser should recognize adjacent terms without operators are analyze as a group so that multi-token synonyms can be recognized.

          Show
          jkrupan Jack Krupansky added a comment - Hey Michael McCandless , don't get discouraged, this was a very valuable exercise. I am a solid proponent of getting multi-term synonyms working in a full and robust manner, but I recognize that they just don't fit in cleanly with the existing flat token stream architecture. That's life. In any case, don't give up on this long-term effort. Maybe the best thing for now is to retain the traditional flat synonym filter for compatibility, fully add the new SynonymGraphFilter, and then add the optional ability to enable graph support in the main Lucene query parser. (Alas, Solr, has its own fork of the Lucene query parser.) Support within phrase queries is the tricky part. It would also be good to address the issue with non-phrase terms being analyzed separately - the query parser should recognize adjacent terms without operators are analyze as a group so that multi-token synonyms can be recognized.
          Hide
          mikemccand Michael McCandless added a comment -

          don't get discouraged,

          I'm not discouraged.

          Many issues have been blocked before because they are controversial. I resolved this as "later" because that's the reality of what's happening: it will be a long time until we can fix these bugs in SynonymFilter.

          This is just how Apache open source works: it's inherently a conservative / design by committee development model, and one veto to a change blocks it. Only changes everyone agrees on are allowed. The more successful the project, the more conservative its development becomes.

          The few users who are affected by the buggy SynonymFilter we have today can always test SynonymGraphFilter in this patch with query-time synonyms to confirm it fixes their phrase query bugs (please report back!).

          But users are definitely affected by these bugs today, e.g. see https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter where there's lots of exploration on how to work around the bugs that this patch in fact fixes correctly, if you are willing/able to use query-time synonyms.

          My original blog post on this topic also explains the bugs: http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

          I recognize that they just don't fit in cleanly with the existing flat token stream architecture.

          I disagree.

          This is exactly why we added PosLengthAttribute originally, and e.g. Kuromoji makes use of that very well: it produces a graph token stream. I think there is an overblown/irrational fear of graph tokenizers at work here... maybe we should remove all graph tokenizers/token filters, along with PosLengthAttribute? To arbitrarily declare that only tokenizers (not token filters) can create new positions makes no sense to me: the resulting output from the tokenizer or the token filter is indistinguishable.

          Furthermore, the graph flattening filter in this patch gives good index-time back compat if you apply your synonyms during indexing, while also enabling bug-free query time multi-token synonyms.

          enable graph support in the main Lucene query parser.

          Right, this is a missing part now. We do have TermAutomatonQuery, to execute the full token graph correctly, but we still have to fix the query parser to somehow produce that query when it "sees" a graph when tokenizing a phrase query? Maybe that's not so hard, e.g. we could always create a TermAutomatonQuery but fix that query to rewrite to a simple PhraseQuery or MultiPhraseQuery if it was an "easy" case?

          (Alas, Solr, has its own fork of the Lucene query parser.)

          Hmm, why? There are so many query parsers now ...

          It would also be good to address the issue with non-phrase terms being analyzed separately

          Hmm what does this mean? I thought the query parsers analyze whole text chunks between operators, so they could already apply multi-token synonyms not inside a phrase query?

          Show
          mikemccand Michael McCandless added a comment - don't get discouraged, I'm not discouraged. Many issues have been blocked before because they are controversial. I resolved this as "later" because that's the reality of what's happening: it will be a long time until we can fix these bugs in SynonymFilter . This is just how Apache open source works: it's inherently a conservative / design by committee development model, and one veto to a change blocks it. Only changes everyone agrees on are allowed. The more successful the project, the more conservative its development becomes. The few users who are affected by the buggy SynonymFilter we have today can always test SynonymGraphFilter in this patch with query-time synonyms to confirm it fixes their phrase query bugs (please report back!). But users are definitely affected by these bugs today, e.g. see https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter where there's lots of exploration on how to work around the bugs that this patch in fact fixes correctly, if you are willing/able to use query-time synonyms. My original blog post on this topic also explains the bugs: http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html I recognize that they just don't fit in cleanly with the existing flat token stream architecture. I disagree. This is exactly why we added PosLengthAttribute originally, and e.g. Kuromoji makes use of that very well: it produces a graph token stream. I think there is an overblown/irrational fear of graph tokenizers at work here... maybe we should remove all graph tokenizers/token filters, along with PosLengthAttribute ? To arbitrarily declare that only tokenizers (not token filters) can create new positions makes no sense to me: the resulting output from the tokenizer or the token filter is indistinguishable. Furthermore, the graph flattening filter in this patch gives good index-time back compat if you apply your synonyms during indexing, while also enabling bug-free query time multi-token synonyms. enable graph support in the main Lucene query parser. Right, this is a missing part now. We do have TermAutomatonQuery , to execute the full token graph correctly, but we still have to fix the query parser to somehow produce that query when it "sees" a graph when tokenizing a phrase query? Maybe that's not so hard, e.g. we could always create a TermAutomatonQuery but fix that query to rewrite to a simple PhraseQuery or MultiPhraseQuery if it was an "easy" case? (Alas, Solr, has its own fork of the Lucene query parser.) Hmm, why? There are so many query parsers now ... It would also be good to address the issue with non-phrase terms being analyzed separately Hmm what does this mean? I thought the query parsers analyze whole text chunks between operators, so they could already apply multi-token synonyms not inside a phrase query?
          Hide
          mikemccand Michael McCandless added a comment -

          Maybe that's not so hard, e.g. we could always create a TermAutomatonQuery but fix that query to rewrite to a simple PhraseQuery or MultiPhraseQuery if it was an "easy" case?

          I opened LUCENE-6824 for this ... seems like a low hanging fruit to make query parser integration easier.

          Show
          mikemccand Michael McCandless added a comment - Maybe that's not so hard, e.g. we could always create a TermAutomatonQuery but fix that query to rewrite to a simple PhraseQuery or MultiPhraseQuery if it was an "easy" case? I opened LUCENE-6824 for this ... seems like a low hanging fruit to make query parser integration easier.
          Hide
          steve_rowe Steve Rowe added a comment -

          Michael McCandless, I think your repurposing of posincr/poslen on this issue (as node ids) is to enable non-lossy query parser interpretation of token streams, so that e.g. tokens from overlapping phrases aren't inappropriately interleaved in generated queries, like your wtf example on LUCENE-6582):

          if I have these synonyms:

          wtf --> what the fudge
          wtf --> wow that's funny
          

          And then I'm tokenizing this:

          wtf happened
          

          Before this change (today) I get this crazy sausage incorrectly
          matching phrases like "wtf the fudge" and "wow happened funny":

          But after this change, the expanded synonyms become separate paths in
          the graph right? So it will look like this?:

          An alternative implementation idea I had, which would not change posincr/poslen semantics, is to add a new attribute encoding an entity ID. Graph-aware producers would mark tokens that should be treated as a sequence with the same entity ID, and graph-aware consumers would use the entity ID to losslessly interpret the resulting graph. Here's the wtf example using this scheme:

          token posInc posLen entityID
          wtf 1 3 0
          what 0 1 1
          wow 0 1 2
          the 1 1 1
          that's 0 1 2
          fudge 1 1 1
          funny 0 1 2
          happened 1 1 3

          No flattening stage is required. Non-graph-aware components aren't affected (I think). And handling QueryParser.autoGeneratePhraseQueries() properly (see LUCENE-7533) would be easy: if more than one token has the same entityID, then it should be a phrase when autoGeneratePhraseQueries=true.

          I haven't written any code yet, so I'm not sure this idea is feasible.

          Thoughts?

          Show
          steve_rowe Steve Rowe added a comment - Michael McCandless , I think your repurposing of posincr/poslen on this issue (as node ids) is to enable non-lossy query parser interpretation of token streams, so that e.g. tokens from overlapping phrases aren't inappropriately interleaved in generated queries, like your wtf example on LUCENE-6582 ): if I have these synonyms: wtf --> what the fudge wtf --> wow that's funny And then I'm tokenizing this: wtf happened Before this change (today) I get this crazy sausage incorrectly matching phrases like "wtf the fudge" and "wow happened funny": But after this change, the expanded synonyms become separate paths in the graph right? So it will look like this?: An alternative implementation idea I had, which would not change posincr/poslen semantics, is to add a new attribute encoding an entity ID. Graph-aware producers would mark tokens that should be treated as a sequence with the same entity ID, and graph-aware consumers would use the entity ID to losslessly interpret the resulting graph. Here's the wtf example using this scheme: token posInc posLen entityID wtf 1 3 0 what 0 1 1 wow 0 1 2 the 1 1 1 that's 0 1 2 fudge 1 1 1 funny 0 1 2 happened 1 1 3 No flattening stage is required. Non-graph-aware components aren't affected (I think). And handling QueryParser.autoGeneratePhraseQueries() properly (see LUCENE-7533 ) would be easy: if more than one token has the same entityID, then it should be a phrase when autoGeneratePhraseQueries=true. I haven't written any code yet, so I'm not sure this idea is feasible. Thoughts?
          Hide
          mikemccand Michael McCandless added a comment -

          This is a neat approach!

          I like that no explicit flattening stage is required, though I think it means graph-producing stages must do their own flattening, to resolve side-paths into equivalent flattened positions? But maybe we could factor out a little thingy to share code.

          If a side path spawns another side path it would get a new entity ID right?

          And then, a node ID is really just the tuple of (entityID, position), and you can losslessly reconstruct the graph.

          Show
          mikemccand Michael McCandless added a comment - This is a neat approach! I like that no explicit flattening stage is required, though I think it means graph-producing stages must do their own flattening, to resolve side-paths into equivalent flattened positions? But maybe we could factor out a little thingy to share code. If a side path spawns another side path it would get a new entity ID right? And then, a node ID is really just the tuple of (entityID, position), and you can losslessly reconstruct the graph.
          Hide
          steve_rowe Steve Rowe added a comment -

          I think it means graph-producing stages must do their own flattening, to resolve side-paths into equivalent flattened positions? But maybe we could factor out a little thingy to share code.

          +1

          If a side path spawns another side path it would get a new entity ID right?

          I think so, yes, though that points to a problem: new entity IDs would have to be produced without knowing which ones have already been used - otherwise a filter would have to consume the whole stream before it could assign a new one. I was assuming it would be one-up integers, but that won't work in this case. Hmm.

          Show
          steve_rowe Steve Rowe added a comment - I think it means graph-producing stages must do their own flattening, to resolve side-paths into equivalent flattened positions? But maybe we could factor out a little thingy to share code. +1 If a side path spawns another side path it would get a new entity ID right? I think so, yes, though that points to a problem: new entity IDs would have to be produced without knowing which ones have already been used - otherwise a filter would have to consume the whole stream before it could assign a new one. I was assuming it would be one-up integers, but that won't work in this case. Hmm.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit a86f807685403537c20aa697b7c7e06bd97cbdf9 in lucene-solr's branch refs/heads/master from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a86f807 ]

          LUCENE-6664: add getter

          Show
          jira-bot ASF subversion and git services added a comment - Commit a86f807685403537c20aa697b7c7e06bd97cbdf9 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a86f807 ] LUCENE-6664 : add getter
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 3046f86ce7344dda560ecd925eb22fc05b6d5f1f in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3046f86 ]

          LUCENE-6664: add getter

          Show
          jira-bot ASF subversion and git services added a comment - Commit 3046f86ce7344dda560ecd925eb22fc05b6d5f1f in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3046f86 ] LUCENE-6664 : add getter
          Hide
          mikemccand Michael McCandless added a comment -

          I'm re-opening this issue: I think my original patch here is a good way to move forward. It is a simple, backwards compatible way, for token streams to naturally produce graphs, and to empower token filters to create new positions.

          Existing token streams, that produce posInc=0 or posInc=1 and posLength=1 tokens, naturally work the way they do today with this change, producing "sausage" graphs.

          Graph-aware token streams, like the new SynonymGraphFilter here, the Kuromoji JapaneseTokenizer, and WordDelimiterFilter if we improve it, can produce correct graphs which can be used at query time to make accurate queries.

          Today, multi-word synonyms are buggy (see https://lucidworks.com/blog/2014/07/12/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter and http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html), missing hits that should match, and incorrectly returning hits that should not match, for queries that involve the synonyms. With this change, if you use query time synonym expansion, along with separate improvements to query parser, it would fix the bug. The required changes to query parsing are surprisingly contained ... see https://github.com/elastic/elasticsearch/pull/21517 as an example approach.

          I am not proposing, here, that the Lucene index format be changed to support indexing a position graph. Instead, I'm proposing that we make it possible for query-time position graphs to work correctly, so multi-token synonyms are no longer buggy, and I think this is a good way to make that happen.

          Show
          mikemccand Michael McCandless added a comment - I'm re-opening this issue: I think my original patch here is a good way to move forward. It is a simple, backwards compatible way, for token streams to naturally produce graphs, and to empower token filters to create new positions. Existing token streams, that produce posInc=0 or posInc=1 and posLength=1 tokens, naturally work the way they do today with this change, producing "sausage" graphs. Graph-aware token streams, like the new SynonymGraphFilter here, the Kuromoji JapaneseTokenizer , and WordDelimiterFilter if we improve it, can produce correct graphs which can be used at query time to make accurate queries. Today, multi-word synonyms are buggy (see https://lucidworks.com/blog/2014/07/12/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter and http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html ), missing hits that should match, and incorrectly returning hits that should not match, for queries that involve the synonyms. With this change, if you use query time synonym expansion, along with separate improvements to query parser, it would fix the bug. The required changes to query parsing are surprisingly contained ... see https://github.com/elastic/elasticsearch/pull/21517 as an example approach. I am not proposing, here, that the Lucene index format be changed to support indexing a position graph. Instead, I'm proposing that we make it possible for query-time position graphs to work correctly, so multi-token synonyms are no longer buggy, and I think this is a good way to make that happen.
          Hide
          teofili Tommaso Teofili added a comment -

          I'm proposing that we make it possible for query-time position graphs to work correctly, so multi-token synonyms are no longer buggy, and I think this is a good way to make that happen.

          +1

          Show
          teofili Tommaso Teofili added a comment - I'm proposing that we make it possible for query-time position graphs to work correctly, so multi-token synonyms are no longer buggy, and I think this is a good way to make that happen. +1
          Hide
          dsmiley David Smiley added a comment -

          +1. This should be experimental for now as people begin to try this out.

          I don't think this patch abuses the semantics of posInc or posLen. It's a shame most (all?) filters don't handle input posLen != 1 properly but that's a separate issue.

          Show
          dsmiley David Smiley added a comment - +1. This should be experimental for now as people begin to try this out. I don't think this patch abuses the semantics of posInc or posLen. It's a shame most (all?) filters don't handle input posLen != 1 properly but that's a separate issue.
          Hide
          mikemccand Michael McCandless added a comment -

          Here's another patch, just modernizing the last one to apply to
          current master, renaming SausageGraphFilter to
          FlattenGraphFilter and fixing a few javadocs. I think it's
          ready.

          Show
          mikemccand Michael McCandless added a comment - Here's another patch, just modernizing the last one to apply to current master, renaming SausageGraphFilter to FlattenGraphFilter and fixing a few javadocs. I think it's ready.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit c0467bb929133605fca2bc63fe1ebba758332d41 in lucene-solr's branch refs/heads/master from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c0467bb ]

          LUCENE-6664: add SynonymGraphFilter for correct multi-token synonym handling

          Show
          jira-bot ASF subversion and git services added a comment - Commit c0467bb929133605fca2bc63fe1ebba758332d41 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c0467bb ] LUCENE-6664 : add SynonymGraphFilter for correct multi-token synonym handling
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 68db0334089b3cb052d660d143d06aaedd7b922c in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=68db033 ]

          LUCENE-6664: add SynonymGraphFilter for correct multi-token synonym handling

          Show
          jira-bot ASF subversion and git services added a comment - Commit 68db0334089b3cb052d660d143d06aaedd7b922c in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=68db033 ] LUCENE-6664 : add SynonymGraphFilter for correct multi-token synonym handling
          Hide
          steve_rowe Steve Rowe added a comment -

          Looks like the new FlattenGraphFilter is implicated in this reproducing TestRandomChains failure from Policeman Jenkins https://builds.apache.org/job/Lucene-Solr-NightlyTests-master/1193/:

             [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
             [junit4]   2> TEST FAIL: useCharFilter=true text='\ud991\udc33\u0662 vb wlvvo \ufe0f\ufe04\ufe01\ufe05\ufe00\ufe07 ]u[{1,5 ntwwqlyvt \ua4ed\ua4d2\ua4ff\ua4fd\ua4ef\ua4db\ua4e3\ua4e4\ua4db\ua4e2\ua4ea jlrzerz'
             [junit4]   2> Exception from random analyzer: 
             [junit4]   2> charfilters=
             [junit4]   2> tokenizer=
             [junit4]   2>   org.apache.lucene.analysis.wikipedia.WikipediaTokenizer()
             [junit4]   2> filters=
             [junit4]   2>   org.apache.lucene.analysis.commongrams.CommonGramsFilter(ValidatingTokenFilter@68052231 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0, [wfo, i, ecngk, lntfhzycu, f])
             [junit4]   2>   org.apache.lucene.analysis.miscellaneous.KeywordRepeatFilter(ValidatingTokenFilter@4c507013 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,keyword=false)
             [junit4]   2>   org.apache.lucene.analysis.synonym.FlattenGraphFilter(ValidatingTokenFilter@27c78e22 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,keyword=false)
             [junit4]   2> offsetsAreCorrect=false
             [junit4]   2> NOTE: download the large Jenkins line-docs file by running 'ant get-jenkins-line-docs' in the lucene directory.
             [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=740BB1C4895371B0 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-master/test-data/enwiki.random.lines.txt -Dtests.locale=es-CO -Dtests.timezone=EET -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
             [junit4] ERROR   14.8s J0 | TestRandomChains.testRandomChainsWithLargeStrings <<<
             [junit4]    > Throwable #1: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, startOffset=24,endOffset=22
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([740BB1C4895371B0:1E500ED5D01D5143]:0)
             [junit4]    > 	at org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:107)
             [junit4]    > 	at org.apache.lucene.analysis.synonym.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:237)
             [junit4]    > 	at org.apache.lucene.analysis.synonym.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:264)
             [junit4]    > 	at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:724)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:635)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:533)
             [junit4]    > 	at org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:869)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
             [junit4]   2> NOTE: leaving temporary files on disk at: /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-master/checkout/lucene/build/analysis/common/test/J0/temp/lucene.analysis.core.TestRandomChains_740BB1C4895371B0-001
             [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {dummy=PostingsFormat(name=LuceneVarGapFixedInterval)}, docValues:{}, maxPointsInLeafNode=1442, maxMBSortInHeap=6.705070576143851, sim=RandomSimilarity(queryNorm=true): {dummy=DFR I(n)L1}, locale=es-CO, timezone=EET
             [junit4]   2> NOTE: Linux 3.13.0-85-generic amd64/Oracle Corporation 1.8.0_102 (64-bit)/cpus=4,threads=1,free=136682616,total=285736960
             [junit4]   2> NOTE: All tests run in this JVM: [TestRussianLightStemFilter, TestEnglishAnalyzer, EdgeNGramTokenFilterTest, TestSwedishAnalyzer, TestHindiFilters, TestHindiNormalizer, TestHungarianLightStemFilterFactory, TestPorterStemFilter, TestCondition, TestTruncateTokenFilterFactory, TestCollationKeyAnalyzer, TestSpanishAnalyzer, TestHTMLStripCharFilterFactory, TestArabicFilters, TestFactories, TestIrishLowerCaseFilterFactory, TestBrazilianAnalyzer, TestLatvianAnalyzer, TestEscaped, TestPortugueseLightStemFilter, TestElisionFilterFactory, TestHungarianAnalyzer, TestGreekLowerCaseFilterFactory, TestElision, TestCustomAnalyzer, TestTurkishLowerCaseFilterFactory, TestFullStrip, TestSpanishLightStemFilterFactory, TestNorwegianMinimalStemFilterFactory, QueryAutoStopWordAnalyzerTest, NGramTokenizerTest, WikipediaTokenizerTest, TestIndonesianStemFilterFactory, TestBasqueAnalyzer, DateRecognizerFilterFactoryTest, TestNorwegianLightStemFilter, TestFrenchMinimalStemFilterFactory, TestCommonGramsFilterFactory, TestPersianNormalizationFilterFactory, TestTwoSuffixes, TestIndonesianStemmer, TypeAsPayloadTokenFilterTest, TestFrenchLightStemFilterFactory, TestThaiAnalyzer, TestCaseSensitive, TestRandomChains]
             [junit4] Completed [133/275 (1!)] on J0 in 121.18s, 2 tests, 1 error <<< FAILURES!
          
          Show
          steve_rowe Steve Rowe added a comment - Looks like the new FlattenGraphFilter is implicated in this reproducing TestRandomChains failure from Policeman Jenkins https://builds.apache.org/job/Lucene-Solr-NightlyTests-master/1193/ : [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains [junit4] 2> TEST FAIL: useCharFilter=true text='\ud991\udc33\u0662 vb wlvvo \ufe0f\ufe04\ufe01\ufe05\ufe00\ufe07 ]u[{1,5 ntwwqlyvt \ua4ed\ua4d2\ua4ff\ua4fd\ua4ef\ua4db\ua4e3\ua4e4\ua4db\ua4e2\ua4ea jlrzerz' [junit4] 2> Exception from random analyzer: [junit4] 2> charfilters= [junit4] 2> tokenizer= [junit4] 2> org.apache.lucene.analysis.wikipedia.WikipediaTokenizer() [junit4] 2> filters= [junit4] 2> org.apache.lucene.analysis.commongrams.CommonGramsFilter(ValidatingTokenFilter@68052231 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0, [wfo, i, ecngk, lntfhzycu, f]) [junit4] 2> org.apache.lucene.analysis.miscellaneous.KeywordRepeatFilter(ValidatingTokenFilter@4c507013 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,keyword=false) [junit4] 2> org.apache.lucene.analysis.synonym.FlattenGraphFilter(ValidatingTokenFilter@27c78e22 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,keyword=false) [junit4] 2> offsetsAreCorrect=false [junit4] 2> NOTE: download the large Jenkins line-docs file by running 'ant get-jenkins-line-docs' in the lucene directory. [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=740BB1C4895371B0 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-master/test-data/enwiki.random.lines.txt -Dtests.locale=es-CO -Dtests.timezone=EET -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] ERROR 14.8s J0 | TestRandomChains.testRandomChainsWithLargeStrings <<< [junit4] > Throwable #1: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, startOffset=24,endOffset=22 [junit4] > at __randomizedtesting.SeedInfo.seed([740BB1C4895371B0:1E500ED5D01D5143]:0) [junit4] > at org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:107) [junit4] > at org.apache.lucene.analysis.synonym.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:237) [junit4] > at org.apache.lucene.analysis.synonym.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:264) [junit4] > at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:724) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:635) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:533) [junit4] > at org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:869) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: leaving temporary files on disk at: /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-master/checkout/lucene/build/analysis/common/test/J0/temp/lucene.analysis.core.TestRandomChains_740BB1C4895371B0-001 [junit4] 2> NOTE: test params are: codec=Asserting(Lucene70): {dummy=PostingsFormat(name=LuceneVarGapFixedInterval)}, docValues:{}, maxPointsInLeafNode=1442, maxMBSortInHeap=6.705070576143851, sim=RandomSimilarity(queryNorm=true): {dummy=DFR I(n)L1}, locale=es-CO, timezone=EET [junit4] 2> NOTE: Linux 3.13.0-85-generic amd64/Oracle Corporation 1.8.0_102 (64-bit)/cpus=4,threads=1,free=136682616,total=285736960 [junit4] 2> NOTE: All tests run in this JVM: [TestRussianLightStemFilter, TestEnglishAnalyzer, EdgeNGramTokenFilterTest, TestSwedishAnalyzer, TestHindiFilters, TestHindiNormalizer, TestHungarianLightStemFilterFactory, TestPorterStemFilter, TestCondition, TestTruncateTokenFilterFactory, TestCollationKeyAnalyzer, TestSpanishAnalyzer, TestHTMLStripCharFilterFactory, TestArabicFilters, TestFactories, TestIrishLowerCaseFilterFactory, TestBrazilianAnalyzer, TestLatvianAnalyzer, TestEscaped, TestPortugueseLightStemFilter, TestElisionFilterFactory, TestHungarianAnalyzer, TestGreekLowerCaseFilterFactory, TestElision, TestCustomAnalyzer, TestTurkishLowerCaseFilterFactory, TestFullStrip, TestSpanishLightStemFilterFactory, TestNorwegianMinimalStemFilterFactory, QueryAutoStopWordAnalyzerTest, NGramTokenizerTest, WikipediaTokenizerTest, TestIndonesianStemFilterFactory, TestBasqueAnalyzer, DateRecognizerFilterFactoryTest, TestNorwegianLightStemFilter, TestFrenchMinimalStemFilterFactory, TestCommonGramsFilterFactory, TestPersianNormalizationFilterFactory, TestTwoSuffixes, TestIndonesianStemmer, TypeAsPayloadTokenFilterTest, TestFrenchLightStemFilterFactory, TestThaiAnalyzer, TestCaseSensitive, TestRandomChains] [junit4] Completed [133/275 (1!)] on J0 in 121.18s, 2 tests, 1 error <<< FAILURES!
          Hide
          mikemccand Michael McCandless added a comment -

          Thanks Steve Rowe, I'll look...

          Show
          mikemccand Michael McCandless added a comment - Thanks Steve Rowe , I'll look...
          Hide
          steve_rowe Steve Rowe added a comment -

          Another TestRandomChains failure, from https://builds.apache.org/job/Lucene-Solr-NightlyTests-6.x/239/:

          Checking out Revision 9dde8a30303de4bce5da45189219dd6150252b29 (refs/remotes/origin/branch_6x)
          [...]
             [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
             [junit4]   2> TEST FAIL: useCharFilter=true text='ivi[q.(k--r f\u0002\uf672o\u983c'
             [junit4]   2> Exception from random analyzer: 
             [junit4]   2> charfilters=
             [junit4]   2>   org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@261ff7a0)
             [junit4]   2> tokenizer=
             [junit4]   2>   org.apache.lucene.analysis.wikipedia.WikipediaTokenizer()
             [junit4]   2> filters=
             [junit4]   2>   org.apache.lucene.analysis.StopFilter(ValidatingTokenFilter@62f3af70 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0, [hejalskyy, d, skap, nfd, nirasnsmg, hmdqqn])
             [junit4]   2>   org.apache.lucene.analysis.synonym.FlattenGraphFilter(ValidatingTokenFilter@1a9001e5 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0)
             [junit4]   2> offsetsAreCorrect=false
             [junit4]   2> NOTE: download the large Jenkins line-docs file by running 'ant get-jenkins-line-docs' in the lucene directory.
             [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=127E19CE02B54D17 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-6.x/test-data/enwiki.random.lines.txt -Dtests.locale=zh-HK -Dtests.timezone=America/Virgin -Dtests.asserts=true -Dtests.file.encoding=UTF-8
             [junit4] ERROR   62.4s J1 | TestRandomChains.testRandomChains <<<
             [junit4]    > Throwable #1: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, startOffset=4,endOffset=3
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([127E19CE02B54D17:2F9F30AF45A750D7]:0)
             [junit4]    > 	at org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:107)
             [junit4]    > 	at org.apache.lucene.analysis.synonym.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:237)
             [junit4]    > 	at org.apache.lucene.analysis.synonym.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:264)
             [junit4]    > 	at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:723)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:634)
             [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:532)
             [junit4]    > 	at org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:842)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
             [junit4]   2> NOTE: leaving temporary files on disk at: /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-6.x/checkout/lucene/build/analysis/common/test/J1/temp/lucene.analysis.core.TestRandomChains_127E19CE02B54D17-001
             [junit4]   2> NOTE: test params are: codec=Asserting(Lucene62): {dummy=PostingsFormat(name=Memory doPackFST= true)}, docValues:{}, maxPointsInLeafNode=772, maxMBSortInHeap=6.693205231616328, sim=RandomSimilarity(queryNorm=false,coord=yes): {dummy=DFR I(F)LZ(0.3)}, locale=zh-HK, timezone=America/Virgin
             [junit4]   2> NOTE: Linux 3.13.0-85-generic amd64/Oracle Corporation 1.8.0_102 (64-bit)/cpus=4,threads=1,free=177327976,total=255852544
             [junit4]   2> NOTE: All tests run in this JVM: [TestPatternTokenizerFactory, TestCircumfix, TestReverseStringFilterFactory, TestSnowball, TestIrishAnalyzer, TestBulgarianAnalyzer, TestHomonyms, TestKeywordRepeatFilter, TestPrefixAwareTokenFilter, CommonGramsFilterTest, TestHyphenationCompoundWordTokenFilterFactory, TestSoraniAnalyzer, TestGermanStemFilterFactory, TestEmptyTokenStream, TestIndicNormalizer, TestTurkishLowerCaseFilter, TestGalicianMinimalStemFilterFactory, TestDecimalDigitFilterFactory, TestLatvianStemmer, TestItalianLightStemFilter, TestKeepWordFilter, TestLithuanianStemming, TestKeepFilterFactory, TestPortugueseMinimalStemFilter, TestAnalyzers, TestAlternateCasing, TestSoraniStemFilter, TestApostropheFilterFactory, TestDictionary, TestCodepointCountFilterFactory, TestDanishAnalyzer, TestRomanianAnalyzer, TestPortugueseMinimalStemFilterFactory, TestArabicNormalizationFilter, TestLimitTokenOffsetFilterFactory, TestZeroAffix, DateRecognizerFilterTest, TestGermanLightStemFilter, TestCJKAnalyzer, TestMorphData, TestBulgarianStemFilterFactory, TestSynonymGraphFilter, TestGermanNormalizationFilterFactory, TestBulgarianStemmer, DelimitedPayloadTokenFilterTest, TestStrangeOvergeneration, TestFactories, TestRandomChains]
             [junit4] Completed [141/275 (1!)] on J1 in 119.08s, 2 tests, 1 error <<< FAILURES!
          
          Show
          steve_rowe Steve Rowe added a comment - Another TestRandomChains failure, from https://builds.apache.org/job/Lucene-Solr-NightlyTests-6.x/239/ : Checking out Revision 9dde8a30303de4bce5da45189219dd6150252b29 (refs/remotes/origin/branch_6x) [...] [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains [junit4] 2> TEST FAIL: useCharFilter=true text='ivi[q.(k--r f\u0002\uf672o\u983c' [junit4] 2> Exception from random analyzer: [junit4] 2> charfilters= [junit4] 2> org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@261ff7a0) [junit4] 2> tokenizer= [junit4] 2> org.apache.lucene.analysis.wikipedia.WikipediaTokenizer() [junit4] 2> filters= [junit4] 2> org.apache.lucene.analysis.StopFilter(ValidatingTokenFilter@62f3af70 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0, [hejalskyy, d, skap, nfd, nirasnsmg, hmdqqn]) [junit4] 2> org.apache.lucene.analysis.synonym.FlattenGraphFilter(ValidatingTokenFilter@1a9001e5 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0) [junit4] 2> offsetsAreCorrect=false [junit4] 2> NOTE: download the large Jenkins line-docs file by running 'ant get-jenkins-line-docs' in the lucene directory. [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=127E19CE02B54D17 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-6.x/test-data/enwiki.random.lines.txt -Dtests.locale=zh-HK -Dtests.timezone=America/Virgin -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 62.4s J1 | TestRandomChains.testRandomChains <<< [junit4] > Throwable #1: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, startOffset=4,endOffset=3 [junit4] > at __randomizedtesting.SeedInfo.seed([127E19CE02B54D17:2F9F30AF45A750D7]:0) [junit4] > at org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:107) [junit4] > at org.apache.lucene.analysis.synonym.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:237) [junit4] > at org.apache.lucene.analysis.synonym.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:264) [junit4] > at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:723) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:634) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:532) [junit4] > at org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:842) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: leaving temporary files on disk at: /x1/jenkins/jenkins-slave/workspace/Lucene-Solr-NightlyTests-6.x/checkout/lucene/build/analysis/common/test/J1/temp/lucene.analysis.core.TestRandomChains_127E19CE02B54D17-001 [junit4] 2> NOTE: test params are: codec=Asserting(Lucene62): {dummy=PostingsFormat(name=Memory doPackFST= true)}, docValues:{}, maxPointsInLeafNode=772, maxMBSortInHeap=6.693205231616328, sim=RandomSimilarity(queryNorm=false,coord=yes): {dummy=DFR I(F)LZ(0.3)}, locale=zh-HK, timezone=America/Virgin [junit4] 2> NOTE: Linux 3.13.0-85-generic amd64/Oracle Corporation 1.8.0_102 (64-bit)/cpus=4,threads=1,free=177327976,total=255852544 [junit4] 2> NOTE: All tests run in this JVM: [TestPatternTokenizerFactory, TestCircumfix, TestReverseStringFilterFactory, TestSnowball, TestIrishAnalyzer, TestBulgarianAnalyzer, TestHomonyms, TestKeywordRepeatFilter, TestPrefixAwareTokenFilter, CommonGramsFilterTest, TestHyphenationCompoundWordTokenFilterFactory, TestSoraniAnalyzer, TestGermanStemFilterFactory, TestEmptyTokenStream, TestIndicNormalizer, TestTurkishLowerCaseFilter, TestGalicianMinimalStemFilterFactory, TestDecimalDigitFilterFactory, TestLatvianStemmer, TestItalianLightStemFilter, TestKeepWordFilter, TestLithuanianStemming, TestKeepFilterFactory, TestPortugueseMinimalStemFilter, TestAnalyzers, TestAlternateCasing, TestSoraniStemFilter, TestApostropheFilterFactory, TestDictionary, TestCodepointCountFilterFactory, TestDanishAnalyzer, TestRomanianAnalyzer, TestPortugueseMinimalStemFilterFactory, TestArabicNormalizationFilter, TestLimitTokenOffsetFilterFactory, TestZeroAffix, DateRecognizerFilterTest, TestGermanLightStemFilter, TestCJKAnalyzer, TestMorphData, TestBulgarianStemFilterFactory, TestSynonymGraphFilter, TestGermanNormalizationFilterFactory, TestBulgarianStemmer, DelimitedPayloadTokenFilterTest, TestStrangeOvergeneration, TestFactories, TestRandomChains] [junit4] Completed [141/275 (1!)] on J1 in 119.08s, 2 tests, 1 error <<< FAILURES!
          Hide
          mikemccand Michael McCandless added a comment -

          Thanks Steve Rowe, I'll have a look, but likely not until I'm back from vacation next year ...

          Show
          mikemccand Michael McCandless added a comment - Thanks Steve Rowe , I'll have a look, but likely not until I'm back from vacation next year ...
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit f6fb6941bb62f8d47d653b2ed187ffa0107cd5c5 in lucene-solr's branch refs/heads/master from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f6fb694 ]

          LUCENE-6664: be more robust to broken token stream offsets

          Show
          jira-bot ASF subversion and git services added a comment - Commit f6fb6941bb62f8d47d653b2ed187ffa0107cd5c5 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f6fb694 ] LUCENE-6664 : be more robust to broken token stream offsets
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit c35fbbd328687f5e309fcb00acf6169122f2a009 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c35fbbd ]

          LUCENE-6664: be more robust to broken token stream offsets

          Show
          jira-bot ASF subversion and git services added a comment - Commit c35fbbd328687f5e309fcb00acf6169122f2a009 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c35fbbd ] LUCENE-6664 : be more robust to broken token stream offsets
          Hide
          mikemccand Michael McCandless added a comment -

          Steve Rowe, I pushed a fix for one of the above failures. They happen when broken offsets are sent into FlattenGraphFilter.

          Show
          mikemccand Michael McCandless added a comment - Steve Rowe , I pushed a fix for one of the above failures. They happen when broken offsets are sent into FlattenGraphFilter .

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              mikemccand Michael McCandless
            • Votes:
              4 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development