Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.4
    • Fix Version/s: None
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      as part of converting tests for SOLR-1657, I ran into a problem with synonymfilter

      the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.

          // repeats
          map.add(strings("a b"), tokens("ab"), orig, merge);
          map.add(strings("a b"), tokens("ab"), orig, merge);
          assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
          /* in reality the result from getTokList is ab ab ab!!!!! */
      

      when converted to assertTokenStreamContents this problem surfaced. attached is an additional assertion to the existing testcase.

      1. SOLR-1670_test.patch
        0.6 kB
        Robert Muir
      2. SOLR-1670.patch
        4 kB
        Yonik Seeley
      3. SOLR-1670.patch
        13 kB
        Steve Rowe
      4. SOLR-1670.patch
        13 kB
        Steve Rowe

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          btw i don't know how to fix this repeat problem (which is why i only uploaded a test, sorry). I am going to get some sleep and look at it tomorrow, its the last tokenstream to go thru I promise.

          if anyone knows this code better please help

          Show
          Robert Muir added a comment - btw i don't know how to fix this repeat problem (which is why i only uploaded a test, sorry). I am going to get some sleep and look at it tomorrow, its the last tokenstream to go thru I promise. if anyone knows this code better please help
          Hide
          Robert Muir added a comment -

          note: i have found more bugs in this filter by reworking the tests.

          it is impossible to fix the real bugs in the synonyms filter the way the tests are written now, as they do not compare correctly.
          I will upload the patch to SOLR-1657 fix all the analysis tests first, so then we have tests that work and can actually start debugging this.
          in the comments i will put what the old test "thought" it was testing as a FIXME

          Show
          Robert Muir added a comment - note: i have found more bugs in this filter by reworking the tests. it is impossible to fix the real bugs in the synonyms filter the way the tests are written now, as they do not compare correctly. I will upload the patch to SOLR-1657 fix all the analysis tests first, so then we have tests that work and can actually start debugging this. in the comments i will put what the old test "thought" it was testing as a FIXME
          Hide
          Koji Sekiguchi added a comment -

          the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one.

          I agree with you regarding this part. But I'm not sure that the following size() should be 1 in your patch:

          +    assertEquals(1, getTokList(map,"a b",false).size());
          

          If what "repeats" implies is repeating same term intentionally, I think it can boost tf.

          Show
          Koji Sekiguchi added a comment - the test for 'repeats' has a flaw, it uses this assertTokEqual construct which does not really validate that two lists of token are equal, it just stops at the shorted one. I agree with you regarding this part. But I'm not sure that the following size() should be 1 in your patch: + assertEquals(1, getTokList(map, "a b" , false ).size()); If what "repeats" implies is repeating same term intentionally, I think it can boost tf.
          Hide
          Robert Muir added a comment -

          Hi Koji, I don't really understand what you are saying here?

          If it is a feature of synonymfilter that it should repeat to boost TF, then the test should have looked like this:

          assertTokEqual(getTokList(map,"a b",false), tokens("ab ab ab"));

          but it does not, the test looks like this:
          assertTokEqual(getTokList(map,"a b",false), tokens("ab"));

          Show
          Robert Muir added a comment - Hi Koji, I don't really understand what you are saying here? If it is a feature of synonymfilter that it should repeat to boost TF, then the test should have looked like this: assertTokEqual(getTokList(map,"a b",false), tokens("ab ab ab")); but it does not, the test looks like this: assertTokEqual(getTokList(map,"a b",false), tokens("ab"));
          Hide
          Robert Muir added a comment -

          Koji, in my opinion the first step towards fixing this would be to resolve SOLR-1674.

          This fixes the tests so that they reveal the true current behavior.
          Because of the many problems with the current tests, I do not understand what the desired behavior of this filter should even be.
          The only thing I know for sure is what the current behavior is.

          here is another example:

              // test that generated tokens start at the same offset as the original
              map.add(strings("a"), tokens("aa"), orig, merge);
              assertTokEqual(getTokList(map,"a,5",false), tokens("aa,5")); /* NOTE: This position increment is really 1, but the test passes!!!! */
          

          All the tests similar to the above test that position increments are the same as the original token.
          But this is not actually what is happening, in fact synonymfilter is generating 'aa' with a position increment of 1.

          There are several examples like this in SOLR-1674. What I did was write the assertions so that the tests pass, revealing the true behavior.
          But I put comments above these assertions that say what i thought the existing test was trying to show, i.e. the desired behavior.

          Show
          Robert Muir added a comment - Koji, in my opinion the first step towards fixing this would be to resolve SOLR-1674 . This fixes the tests so that they reveal the true current behavior. Because of the many problems with the current tests, I do not understand what the desired behavior of this filter should even be. The only thing I know for sure is what the current behavior is. here is another example: // test that generated tokens start at the same offset as the original map.add(strings( "a" ), tokens( "aa" ), orig, merge); assertTokEqual(getTokList(map, "a,5" , false ), tokens( "aa,5" )); /* NOTE: This position increment is really 1, but the test passes!!!! */ All the tests similar to the above test that position increments are the same as the original token. But this is not actually what is happening, in fact synonymfilter is generating 'aa' with a position increment of 1. There are several examples like this in SOLR-1674 . What I did was write the assertions so that the tests pass, revealing the true behavior. But I put comments above these assertions that say what i thought the existing test was trying to show, i.e. the desired behavior.
          Hide
          Koji Sekiguchi added a comment -

          Robert, sorry, I wanted to say I agree with you regarding "the test for 'repeats' has a flaw". Then "boost TF" was just an input, though I don't know it is intentional feature or side effect.

          Why don't you fix the flaws in SynonymFilter test in this ticket first, then fix SOLR-1674? (I've not looked into SOLR-1674 yet.)

          Show
          Koji Sekiguchi added a comment - Robert, sorry, I wanted to say I agree with you regarding "the test for 'repeats' has a flaw". Then "boost TF" was just an input, though I don't know it is intentional feature or side effect. Why don't you fix the flaws in SynonymFilter test in this ticket first, then fix SOLR-1674 ? (I've not looked into SOLR-1674 yet.)
          Hide
          Robert Muir added a comment -

          Why don't you fix the flaws in SynonymFilter test in this ticket first, then fix SOLR-1674? (I've not looked into SOLR-1674 yet.)

          Because how would you know I properly fixed it? In fact, if anyone else submitted a patch to fix this issue, I don't really care how many tests they write, I would post a comment voting against the patch, because the tests mean nothing.

          The analysis tests themselves (things like this assertTokEqual) are broken, so SOLR-1674 fixes the analysis tests so they actually work.

          I think we have to make the tests work correctly, before we can think of fixing nontrivial bugs?

          Show
          Robert Muir added a comment - Why don't you fix the flaws in SynonymFilter test in this ticket first, then fix SOLR-1674 ? (I've not looked into SOLR-1674 yet.) Because how would you know I properly fixed it? In fact, if anyone else submitted a patch to fix this issue, I don't really care how many tests they write, I would post a comment voting against the patch, because the tests mean nothing. The analysis tests themselves (things like this assertTokEqual) are broken, so SOLR-1674 fixes the analysis tests so they actually work. I think we have to make the tests work correctly, before we can think of fixing nontrivial bugs?
          Hide
          Yonik Seeley added a comment -

          This is my mess that predates solr even going open source... I may have a few neurons that can still recognize what it was trying to do.

          The repeats test was testing that repeated rules didn't mess anything up. In the code you quote, the rule a b -> ab is added twice. The rule merging code should handle it (the merging code handles merging all common prefixes).

          Robert, can you tell if the flaw is in the test code or the SynonymFilter? I'm pretty sure that such a bug (in the filter) wasn't originally there... so my money would be on the test - perhaps getTokList, but I haven't had a chance to look yet.

          Show
          Yonik Seeley added a comment - This is my mess that predates solr even going open source... I may have a few neurons that can still recognize what it was trying to do. The repeats test was testing that repeated rules didn't mess anything up. In the code you quote, the rule a b -> ab is added twice. The rule merging code should handle it (the merging code handles merging all common prefixes). Robert, can you tell if the flaw is in the test code or the SynonymFilter? I'm pretty sure that such a bug (in the filter) wasn't originally there... so my money would be on the test - perhaps getTokList, but I haven't had a chance to look yet.
          Hide
          Robert Muir added a comment - - edited

          Robert, can you tell if the flaw is in the test code or the SynonymFilter? I'm pretty sure that such a bug (in the filter) wasn't originally there... so my money would be on the test - perhaps getTokList, but I haven't had a chance to look yet.

          I am pretty sure there is a problem (both the existing test, and the synonymMap merging code).
          This is why i think there is some confusion.
          The existing test implies that the output will be "ab", which makes me think the SynonymMap should merge these duplicate definitions.
          The real output is "ab ab ab", but the test as written (expect "ab") passes with the existing test setup, the problem is assertTokEqual().

          the rewritten test in SOLR-1674 looks like:

          assertTokenizesTo(map, "a b", new String[] { "ab", "ab", "ab"  });
          

          where assertTokenizesTo is just

          static void assertTokenizesTo(SynonymMap dict, String input,
                String expected[]) throws IOException {
              Tokenizer tokenizer = new WhitespaceTokenizer(new StringReader(input));
              SynonymFilter stream = new SynonymFilter(tokenizer, dict);
              assertTokenStreamContents(stream, expected);
            }
          

          and assertTokenStreamContents is the one copied from lucene.

          edit: i think i resolved the additional problems

          Show
          Robert Muir added a comment - - edited Robert, can you tell if the flaw is in the test code or the SynonymFilter? I'm pretty sure that such a bug (in the filter) wasn't originally there... so my money would be on the test - perhaps getTokList, but I haven't had a chance to look yet. I am pretty sure there is a problem (both the existing test, and the synonymMap merging code). This is why i think there is some confusion. The existing test implies that the output will be "ab", which makes me think the SynonymMap should merge these duplicate definitions. The real output is "ab ab ab", but the test as written (expect "ab") passes with the existing test setup, the problem is assertTokEqual(). the rewritten test in SOLR-1674 looks like: assertTokenizesTo(map, "a b" , new String [] { "ab" , "ab" , "ab" }); where assertTokenizesTo is just static void assertTokenizesTo(SynonymMap dict, String input, String expected[]) throws IOException { Tokenizer tokenizer = new WhitespaceTokenizer( new StringReader(input)); SynonymFilter stream = new SynonymFilter(tokenizer, dict); assertTokenStreamContents(stream, expected); } and assertTokenStreamContents is the one copied from lucene. edit: i think i resolved the additional problems
          Hide
          Yonik Seeley added a comment -

          Yep, must be a bug in the merging code, never caught because of a bug in assertTokEqual. I'll look into it.

          Show
          Yonik Seeley added a comment - Yep, must be a bug in the merging code, never caught because of a bug in assertTokEqual. I'll look into it.
          Hide
          Yonik Seeley added a comment -

          Here's a patch that removes duplicate tokens at the same position.
          This isn't really a serious bug, as the repeated tokens were at the same position.

          I don't think assertTokEqual really has a bug - it's written more to match lucene queries and indexes, not to exactly compare one token stream with another. So a singe ab token matches multiple ab tokens at the same position.

          Show
          Yonik Seeley added a comment - Here's a patch that removes duplicate tokens at the same position. This isn't really a serious bug, as the repeated tokens were at the same position. I don't think assertTokEqual really has a bug - it's written more to match lucene queries and indexes, not to exactly compare one token stream with another. So a singe ab token matches multiple ab tokens at the same position.
          Hide
          Robert Muir added a comment -

          Yonik, thanks, I am glad it is not a serious bug!

          I don't think assertTokEqual really has a bug - it's written more to match lucene queries and indexes, not to exactly compare one token stream with another.

          I see your point about assertTokEqual, but it was being used to test tokenstreams... so I am glad to see it go.

          Mainly want to make sure we don't break anything trying to move this stuff to the new tokenstream API, I am sure this will involve more tests, but for now having well-defined behavior makes it a lot easier. Thanks again

          Show
          Robert Muir added a comment - Yonik, thanks, I am glad it is not a serious bug! I don't think assertTokEqual really has a bug - it's written more to match lucene queries and indexes, not to exactly compare one token stream with another. I see your point about assertTokEqual, but it was being used to test tokenstreams... so I am glad to see it go. Mainly want to make sure we don't break anything trying to move this stuff to the new tokenstream API, I am sure this will involve more tests, but for now having well-defined behavior makes it a lot easier. Thanks again
          Hide
          Robert Muir added a comment -

          I don't think assertTokEqual really has a bug - it's written more to match lucene queries and indexes, not to exactly compare one token stream with another. So a singe ab token matches multiple ab tokens at the same position.

          Seriously, I think if you want to test these things, then the assertQ etc should be used (actually run queries and test results) instead.

          But this is a bug, because aa is not the same as aa,aa(pos=0),aa(pos=0), not even for "queries and indexes".
          This is because the latter will affect the score of the search.

          I think this is right in line with what you are saying in SOLR-1674, that you have somehow lost some flexibility: This is not true

          • these things are different, the tokenstream output is different, things such as score change
          • if you don't like this, and instead want to test queries, then do just that, instead of examining tokenstreams.
          Show
          Robert Muir added a comment - I don't think assertTokEqual really has a bug - it's written more to match lucene queries and indexes, not to exactly compare one token stream with another. So a singe ab token matches multiple ab tokens at the same position. Seriously, I think if you want to test these things, then the assertQ etc should be used (actually run queries and test results) instead. But this is a bug, because aa is not the same as aa,aa(pos=0),aa(pos=0), not even for "queries and indexes". This is because the latter will affect the score of the search. I think this is right in line with what you are saying in SOLR-1674 , that you have somehow lost some flexibility: This is not true these things are different, the tokenstream output is different, things such as score change if you don't like this, and instead want to test queries, then do just that, instead of examining tokenstreams.
          Hide
          Yonik Seeley added a comment -

          Duplicated tokens aside (that should have been tested another way), I prefer to test the intended semantics rather than the exact behavior (which often ends up testing the exact implementation and ends up being too fragile... valid changes to the implementation break the tests).

          The description of the synonym filter never specifies the order of overlapping tokens, thus the tests should accept either. But as a practical matter, I don't really care about the restrictive token ordering since it's localized to a single method that can be changed as needed later if the tests break.

          Show
          Yonik Seeley added a comment - Duplicated tokens aside (that should have been tested another way), I prefer to test the intended semantics rather than the exact behavior (which often ends up testing the exact implementation and ends up being too fragile... valid changes to the implementation break the tests). The description of the synonym filter never specifies the order of overlapping tokens, thus the tests should accept either. But as a practical matter, I don't really care about the restrictive token ordering since it's localized to a single method that can be changed as needed later if the tests break.
          Hide
          Robert Muir added a comment -

          Duplicated tokens aside (that should have been tested another way), I prefer to test the intended semantics rather than the exact behavior

          then it would be better to use actual queries and such to test this filter, like (some of the) WordDelimiterFilter tests do with assertQ and friends.

          But I think things like the order of tokens in the stream is still something i would like to preserve when migrating to the new tokenstream API. Maybe it doesn't seem important but it would still be a change in behavior (probably only break you if you ran positionfilter after, or something like that).

          I guess in my opinion, overall its better for the tests to be overly strict, and if in the future we make a valid change to the implementation that breaks a test, we can discuss it during said change, and people can comment on whether this behavior was actually important: for example the aa/a versus a/aa i would probably say not a big deal, but the aa versus aa/aa/aa thing to me is a big deal.

          The alternative, being overly lax, presents the possibility of introducing incorrect behavior without being caught, and I think this is very dangerous.

          Show
          Robert Muir added a comment - Duplicated tokens aside (that should have been tested another way), I prefer to test the intended semantics rather than the exact behavior then it would be better to use actual queries and such to test this filter, like (some of the) WordDelimiterFilter tests do with assertQ and friends. But I think things like the order of tokens in the stream is still something i would like to preserve when migrating to the new tokenstream API. Maybe it doesn't seem important but it would still be a change in behavior (probably only break you if you ran positionfilter after, or something like that). I guess in my opinion, overall its better for the tests to be overly strict, and if in the future we make a valid change to the implementation that breaks a test, we can discuss it during said change, and people can comment on whether this behavior was actually important: for example the aa/a versus a/aa i would probably say not a big deal, but the aa versus aa/aa/aa thing to me is a big deal. The alternative, being overly lax, presents the possibility of introducing incorrect behavior without being caught, and I think this is very dangerous.
          Hide
          Steve Rowe added a comment -

          New version of the patch, introducing the possibility to test token streams that are allowed to vary the order of their overlapping tokens. All tests pass for me.

          Yonik wrote:

          The description of the synonym filter never specifies the order of overlapping tokens, thus the tests should accept either.

          Yonik, can you check that the attached patch does what you are suggesting?

          Robert wrote:

          I guess in my opinion, overall its better for the tests to be overly strict, and if in the future we make a valid change to the implementation that breaks a test, we can discuss it during said change, and people can comment on whether this behavior was actually important: for example the aa/a versus a/aa i would probably say not a big deal, but the aa versus aa/aa/aa thing to me is a big deal.

          The alternative, being overly lax, presents the possibility of introducing incorrect behavior without being caught, and I think this is very dangerous.

          Robert, do you think that the attached patch crosses over into overly lax land?

          Show
          Steve Rowe added a comment - New version of the patch, introducing the possibility to test token streams that are allowed to vary the order of their overlapping tokens. All tests pass for me. Yonik wrote: The description of the synonym filter never specifies the order of overlapping tokens, thus the tests should accept either. Yonik, can you check that the attached patch does what you are suggesting? Robert wrote: I guess in my opinion, overall its better for the tests to be overly strict, and if in the future we make a valid change to the implementation that breaks a test, we can discuss it during said change, and people can comment on whether this behavior was actually important: for example the aa/a versus a/aa i would probably say not a big deal, but the aa versus aa/aa/aa thing to me is a big deal. The alternative, being overly lax, presents the possibility of introducing incorrect behavior without being caught, and I think this is very dangerous. Robert, do you think that the attached patch crosses over into overly lax land?
          Hide
          Robert Muir added a comment -

          Steven, i don't have a problem with your patch (I do not wish to be in the way of anyone trying to work on SynonymFilter)

          But i want to explain some of where i was coming from.

          The main reason i got myself into this mess was to try to add wordnet support to solr. However, this is currently not possible without duplicating a lot of code.
          We need to be really careful about allowing any order, it does matter in some situations.
          For example, in Lucene's synonymfilter (with wordnet support), it has an option to limit the number of expansions (so its like a top-N synonym expansion).
          Solr doesnt currently have this, so its N/A for now, but just an example where the order suddenly becomes important.

          only slightly related: we added some improvements to this assertion in lucene recently and found a lot of bugs, better checking for clearAttribute() and end()
          at some I would like to port these test improvements over to solr, too.

          Show
          Robert Muir added a comment - Steven, i don't have a problem with your patch (I do not wish to be in the way of anyone trying to work on SynonymFilter) But i want to explain some of where i was coming from. The main reason i got myself into this mess was to try to add wordnet support to solr. However, this is currently not possible without duplicating a lot of code. We need to be really careful about allowing any order, it does matter in some situations. For example, in Lucene's synonymfilter (with wordnet support), it has an option to limit the number of expansions (so its like a top-N synonym expansion). Solr doesnt currently have this, so its N/A for now, but just an example where the order suddenly becomes important. only slightly related: we added some improvements to this assertion in lucene recently and found a lot of bugs, better checking for clearAttribute() and end() at some I would like to port these test improvements over to solr, too.
          Hide
          Steve Rowe added a comment -

          We need to be really careful about allowing any order, it does matter in some situations.

          I left in place the existing test method, which requires the specified order.

          only slightly related: we added some improvements to this assertion in lucene recently and found a lot of bugs, better checking for clearAttribute() and end()
          at some I would like to port these test improvements over to solr, too.

          +1

          Show
          Steve Rowe added a comment - We need to be really careful about allowing any order, it does matter in some situations. I left in place the existing test method, which requires the specified order. only slightly related: we added some improvements to this assertion in lucene recently and found a lot of bugs, better checking for clearAttribute() and end() at some I would like to port these test improvements over to solr, too. +1
          Hide
          Robert Muir added a comment -

          I left in place the existing test method, which requires the specified order.

          is it possible to only expose the 'unsorted' one to synonyms test (such as in the synonyms test file itself, rather than base token stream test case?)

          i can't think of another situation where it would make sense, more likely to be abused instead

          Show
          Robert Muir added a comment - I left in place the existing test method, which requires the specified order. is it possible to only expose the 'unsorted' one to synonyms test (such as in the synonyms test file itself, rather than base token stream test case?) i can't think of another situation where it would make sense, more likely to be abused instead
          Hide
          Steve Rowe added a comment -

          is it possible to expose the 'unsorted' one to synonyms test (such as in the synonyms test file itself, rather than base token stream test case?)

          I wrote it the way I did to reduce code duplication/maintenance effort, e.g. the upcoming sync with Lucene's changes in this area.

          I'm thinking it should be possible to refactor the current method to serve both the sorted case and an external unsorted case. I'll work on it.

          Show
          Steve Rowe added a comment - is it possible to expose the 'unsorted' one to synonyms test (such as in the synonyms test file itself, rather than base token stream test case?) I wrote it the way I did to reduce code duplication/maintenance effort, e.g. the upcoming sync with Lucene's changes in this area. I'm thinking it should be possible to refactor the current method to serve both the sorted case and an external unsorted case. I'll work on it.
          Hide
          Yonik Seeley added a comment -

          is it possible to only expose the 'unsorted' one to synonyms test (such as in the synonyms test file itself, rather than base token stream test case?)

          Order of overlapping tokens is unimportant in every TokenFilter used in Solr that I know about. Order-sensitivity is the exception, no?

          Show
          Yonik Seeley added a comment - is it possible to only expose the 'unsorted' one to synonyms test (such as in the synonyms test file itself, rather than base token stream test case?) Order of overlapping tokens is unimportant in every TokenFilter used in Solr that I know about. Order-sensitivity is the exception, no?
          Hide
          Robert Muir added a comment -

          Order of overlapping tokens is unimportant in every TokenFilter used in Solr that I know about. Order-sensitivity is the exception, no?

          I guess all along my problem is that tokenstreams are ordered by definition. if this order does not matter, a test that uses actual queries would make more sense.

          The problem was that the test constructions previously used by this filter were used in other places where they really shouldn't have been, and the laxness hid real bugs (such as this very issue itself!!!).
          This is all I am trying to avoid. There is nothing wrong with Steven's patch/test construction, just trying to err on the side of caution.

          Show
          Robert Muir added a comment - Order of overlapping tokens is unimportant in every TokenFilter used in Solr that I know about. Order-sensitivity is the exception, no? I guess all along my problem is that tokenstreams are ordered by definition. if this order does not matter, a test that uses actual queries would make more sense. The problem was that the test constructions previously used by this filter were used in other places where they really shouldn't have been, and the laxness hid real bugs (such as this very issue itself!!!). This is all I am trying to avoid. There is nothing wrong with Steven's patch/test construction, just trying to err on the side of caution.
          Hide
          Yonik Seeley added a comment -

          I guess all along my problem is that tokenstreams are ordered by definition.

          Not at the semantic level (for overlapping tokens). For instance, it would be incorrect for someone to rely on a specific ordering for overlapping tokens - order has never been guaranteed by any TokenFilters that do produce overlapping tokens, and in fact the ordering for some has changed in the past as the implementation changed.

          Show
          Yonik Seeley added a comment - I guess all along my problem is that tokenstreams are ordered by definition. Not at the semantic level (for overlapping tokens). For instance, it would be incorrect for someone to rely on a specific ordering for overlapping tokens - order has never been guaranteed by any TokenFilters that do produce overlapping tokens, and in fact the ordering for some has changed in the past as the implementation changed.
          Hide
          Robert Muir added a comment -

          Not at the semantic level (for overlapping tokens).

          Another way to look at it is that a tokenstream is just a sequence of tokens, and posInc is just another attribute.

          your description of semantics makes sense in terms of how it is used by the indexer, but the order of these tokens can matter if someone uses a custom tokenfilter, it might matter for some custom attributes, and it might matter for a different consumer, its different behavior. i have made an effort to preserve all the behavior of all these tokenstreams when converting to the new api. I really don't want to break anything.

          Show
          Robert Muir added a comment - Not at the semantic level (for overlapping tokens). Another way to look at it is that a tokenstream is just a sequence of tokens, and posInc is just another attribute. your description of semantics makes sense in terms of how it is used by the indexer, but the order of these tokens can matter if someone uses a custom tokenfilter, it might matter for some custom attributes, and it might matter for a different consumer, its different behavior. i have made an effort to preserve all the behavior of all these tokenstreams when converting to the new api. I really don't want to break anything.
          Hide
          Lance Norskog added a comment -

          Not at the semantic level (for overlapping tokens). For instance, it would be incorrect for someone to rely on a specific ordering for overlapping tokens - order has never been guaranteed by any TokenFilters that do produce overlapping tokens, and in fact the ordering for some has changed in the past as the implementation changed.

          Is this part of the TokenFilter contract documented somewhere?

          Show
          Lance Norskog added a comment - Not at the semantic level (for overlapping tokens). For instance, it would be incorrect for someone to rely on a specific ordering for overlapping tokens - order has never been guaranteed by any TokenFilters that do produce overlapping tokens, and in fact the ordering for some has changed in the past as the implementation changed. Is this part of the TokenFilter contract documented somewhere?
          Hide
          Steve Rowe added a comment -

          I like Yonik's "overlapping tokens" term better than the term I used in my patch ("collocated tokens" - ambiguous, since this is used to refer to phrases rather than shared-position tokens), so I've replaced "collocated" with "overlapping" in variable names in this version of the patch.

          Also, I had forgotten to test for the case where more token stream tokens are present at the same position than are expected. This is now fixed.

          I have not made any other changes - in particular, I have not moved the overlapping-token-order-insensitive test capability out of BaseTokenTestCase.

          All tests pass for me.

          Show
          Steve Rowe added a comment - I like Yonik's "overlapping tokens" term better than the term I used in my patch ("collocated tokens" - ambiguous, since this is used to refer to phrases rather than shared-position tokens), so I've replaced "collocated" with "overlapping" in variable names in this version of the patch. Also, I had forgotten to test for the case where more token stream tokens are present at the same position than are expected. This is now fixed. I have not made any other changes - in particular, I have not moved the overlapping-token-order-insensitive test capability out of BaseTokenTestCase. All tests pass for me.

            People

            • Assignee:
              Yonik Seeley
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Development