Solr
  1. Solr
  2. SOLR-1362

WordDelimiterFilter position increment bug

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      WordDelimiterFilter sometimes assigns high position increment values, which inhibits phrase matches.

      If this is a feature and not a bug please change the issue type, and I will change the patch to propose this as an option...

      1. SOLR-1362_tests.txt
        5 kB
        Robert Muir
      2. SOLR-1362.patch
        7 kB
        Robert Muir
      3. SOLR-1362.patch
        6 kB
        Yonik Seeley
      4. SOLR-1362.patch
        6 kB
        Yonik Seeley
      5. SOLR-1362.patch
        3 kB
        Robert Muir

        Activity

        Hide
        Yonik Seeley added a comment -

        Hmmm, at first blush, it seems like we're just trying to preserve the positioning of the original tokens... so
        "foo !@# bar" will index to "foo (blank) bar"

        Is that the issue?

        Show
        Yonik Seeley added a comment - Hmmm, at first blush, it seems like we're just trying to preserve the positioning of the original tokens... so "foo !@# bar" will index to "foo (blank) bar" Is that the issue?
        Hide
        Robert Muir added a comment -

        yonik, maybe: i am unable to tell from docs/tests if it is a bug or not. from the code (especially the variable name) it looks suspicious.

        but this is what i mean, i guess it could be desirable for some folks? So I could make it another option...?

        Show
        Robert Muir added a comment - yonik, maybe: i am unable to tell from docs/tests if it is a bug or not. from the code (especially the variable name) it looks suspicious. but this is what i mean, i guess it could be desirable for some folks? So I could make it another option...?
        Hide
        Robert Muir added a comment -

        fyi this line of code was changed from = to += in SOLR-14, which in general appears to be an unrelated issue

        there is one comment about wdf doing the "wrong thing" with positions, but i cannot tell if it was intentional or not.

        Show
        Robert Muir added a comment - fyi this line of code was changed from = to += in SOLR-14 , which in general appears to be an unrelated issue there is one comment about wdf doing the "wrong thing" with positions, but i cannot tell if it was intentional or not.
        Hide
        Yonik Seeley added a comment -

        I could see perhaps wanting WDF to not create new gaps, but it should normally preserve existing gaps, right? In which case, a patch would be more complex than conditionally changing "+=" to "="

        Show
        Yonik Seeley added a comment - I could see perhaps wanting WDF to not create new gaps, but it should normally preserve existing gaps, right? In which case, a patch would be more complex than conditionally changing "+=" to "="
        Hide
        Robert Muir added a comment -

        Yonik, in this case I think existing gaps would be preserved with the =
        the question is what should the behavior be for tokens that are all delimiters, currently these are discarded and posInc is incremented for future tokens.

        with the current behavior, if you have "LUCENE / SOLR" (all with posInc=1), this becomes LUCENE (posInc=1), SOLR (posInc=2)
        if you change it to =, if you have "LUCENE / SOLR" (pretend SOLR has posInc=3), the posInc=3 would still be preserved, it would just not become 4.

        its clear to me looking at history its been like this for a long time, so maybe I am incorrect to categorize it as a bug?
        But I found this behavior to be a little unexpected, especially for phrase queries, and given the way the code reads, I wasn't certain if it was intentional.

        Show
        Robert Muir added a comment - Yonik, in this case I think existing gaps would be preserved with the = the question is what should the behavior be for tokens that are all delimiters, currently these are discarded and posInc is incremented for future tokens. with the current behavior, if you have "LUCENE / SOLR" (all with posInc=1), this becomes LUCENE (posInc=1), SOLR (posInc=2) if you change it to =, if you have "LUCENE / SOLR" (pretend SOLR has posInc=3), the posInc=3 would still be preserved, it would just not become 4. its clear to me looking at history its been like this for a long time, so maybe I am incorrect to categorize it as a bug? But I found this behavior to be a little unexpected, especially for phrase queries, and given the way the code reads, I wasn't certain if it was intentional.
        Hide
        Yonik Seeley added a comment -

        Yonik, in this case I think existing gaps would be preserved with the =

        What if the big position increment was on a token that was all delimiters?

        I agree that it makes more sense for "LUCENE / SOLR" to be translated to LUCENE SOLR without a gap though (provided that there are no gaps to start with).

        Should the rule be, subtract 1 from the cumulative position increment if the increment of the current token being added is >=1 ?

        Show
        Yonik Seeley added a comment - Yonik, in this case I think existing gaps would be preserved with the = What if the big position increment was on a token that was all delimiters? I agree that it makes more sense for "LUCENE / SOLR" to be translated to LUCENE SOLR without a gap though (provided that there are no gaps to start with). Should the rule be, subtract 1 from the cumulative position increment if the increment of the current token being added is >=1 ?
        Hide
        Robert Muir added a comment -

        ah, i see your point...

        sounds right to me. i can reformulate the patch/tests in this direction.

        I assume it should be as an option for back compat?

        Show
        Robert Muir added a comment - ah, i see your point... sounds right to me. i can reformulate the patch/tests in this direction. I assume it should be as an option for back compat?
        Hide
        Yonik Seeley added a comment -

        I assume it should be as an option for back compat?

        I guess so, yes... there may be configurations where it makes sense.

        Show
        Yonik Seeley added a comment - I assume it should be as an option for back compat? I guess so, yes... there may be configurations where it makes sense.
        Hide
        Robert Muir added a comment -

        I started working on a patch, but found the existing behavior to be more strange than I originally thought.

        there are some bugs in the existing behavior as well, completely separate but along the same lines of this issue.

        So here are some tests, let me know what you think...

        Show
        Robert Muir added a comment - I started working on a patch, but found the existing behavior to be more strange than I originally thought. there are some bugs in the existing behavior as well, completely separate but along the same lines of this issue. So here are some tests, let me know what you think...
        Hide
        Yonik Seeley added a comment - - edited

        Thanks Robert... looking into it.

        Show
        Yonik Seeley added a comment - - edited Thanks Robert... looking into it.
        Hide
        Yonik Seeley added a comment -

        OK, here's a patch that increases the consistency.

        From here I think we should subtract 1 for every skipped token that had a position increment of 1. It's not possible to fix these bugs and keep back compatibility, so I don't think we need a config option for this.

        Show
        Yonik Seeley added a comment - OK, here's a patch that increases the consistency. From here I think we should subtract 1 for every skipped token that had a position increment of 1. It's not possible to fix these bugs and keep back compatibility, so I don't think we need a config option for this.
        Hide
        Robert Muir added a comment -

        Yonik, thanks! I will work on the skipped token subtraction tonight, but your patch has restored order

        Show
        Robert Muir added a comment - Yonik, thanks! I will work on the skipped token subtraction tonight, but your patch has restored order
        Hide
        Yonik Seeley added a comment -

        I had implemented the "remove normal posIncr" this morning - just got around to adding a test though. Seems to work - can you see if this matches your expectations?

        I think about it like this: position increments serve two purposes... to signify the normal case of tokens being adjacent to each other, and to separate groups of tokens. In the former case, it makes sense to completely consume the delimiter and keep the resulting tokens next to eachother. In the later case, we want to preserve the logical separation.

        Show
        Yonik Seeley added a comment - I had implemented the "remove normal posIncr" this morning - just got around to adding a test though. Seems to work - can you see if this matches your expectations? I think about it like this: position increments serve two purposes... to signify the normal case of tokens being adjacent to each other, and to separate groups of tokens. In the former case, it makes sense to completely consume the delimiter and keep the resulting tokens next to eachother. In the later case, we want to preserve the logical separation.
        Hide
        Robert Muir added a comment -

        I had implemented the "remove normal posIncr" this morning - just got around to adding a test though. Seems to work - can you see if this matches your expectations?

        I tested some with this patch some and I like this behavior.

        I think about it like this: position increments serve two purposes... to signify the normal case of tokens being adjacent to each other, and to separate groups of tokens. In the former case, it makes sense to completely consume the delimiter and keep the resulting tokens next to eachother. In the later case, we want to preserve the logical separation.

        Yes, I think this fixes the behavior for both purposes.

        Show
        Robert Muir added a comment - I had implemented the "remove normal posIncr" this morning - just got around to adding a test though. Seems to work - can you see if this matches your expectations? I tested some with this patch some and I like this behavior. I think about it like this: position increments serve two purposes... to signify the normal case of tokens being adjacent to each other, and to separate groups of tokens. In the former case, it makes sense to completely consume the delimiter and keep the resulting tokens next to eachother. In the later case, we want to preserve the logical separation. Yes, I think this fixes the behavior for both purposes.
        Hide
        Robert Muir added a comment -

        actually one last thing Yonik, at the beginning of the processing loop (now i am just nitpicking)

             //skip protected tokens
              if (protWords != null && protWords.contains(termBuffer, 0, len)) {
                return t;
              }
        
              int posInc = t.getPositionIncrement();
              origPosIncrement += posInc;
              ...
        

        I should have written testcase, but don't you think if you have a LUCENE / SOLR where "/" has a huge gap, and SOLR is in 'protWords' that this might result in strange behavior?

        Show
        Robert Muir added a comment - actually one last thing Yonik, at the beginning of the processing loop (now i am just nitpicking) //skip protected tokens if (protWords != null && protWords.contains(termBuffer, 0, len)) { return t; } int posInc = t.getPositionIncrement(); origPosIncrement += posInc; ... I should have written testcase, but don't you think if you have a LUCENE / SOLR where "/" has a huge gap, and SOLR is in 'protWords' that this might result in strange behavior?
        Hide
        Robert Muir added a comment -

        patch that moves the protWords check below the posInc calculation, and sets it.

        Show
        Robert Muir added a comment - patch that moves the protWords check below the posInc calculation, and sets it.
        Hide
        Yonik Seeley added a comment -

        Committed, thanks Robert!

        Show
        Yonik Seeley added a comment - Committed, thanks Robert!
        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development