Solr
  1. Solr
  2. SOLR-1852

enablePositionIncrements="true" can cause searches to fail when they are parsed as phrase queries

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.4.1
    • Component/s: None
    • Labels:
      None

      Description

      Symptom: searching for a string like a domain name containing a '.', the Solr 1.4 analyzer tells me that I will get a match, but when I enter the search either in the client or directly in Solr, the search fails.
      test string: Identi.ca

      queries that fail: IdentiCa, Identi.ca, Identi-ca

      query that matches: Identi ca

      schema in use is:
      http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34&content-type=text%2Fplain&view=co&pathrev=DRUPAL-6--1

      Screen shots:

      analysis: http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png
      dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png
      dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png
      standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png

      Whether or not the bug appears is determined by the surrounding text:

      "would be great to have support for Identi.ca on the follow block"

      fails to match "Identi.ca", but putting the content on its own or in another sentence:

      "Support Identi.ca"

      the search matches. Testing suggests the word "for" is the problem, and it looks like the bug occurs when a stop word preceeds a word that is split up using the word delimiter filter.

      Setting enablePositionIncrements="false" in the stop filter and reindexing causes the searches to match.

      According to Mark Miller in #solr, this bug appears to be fixed already in Solr trunk, either due to the upgraded lucene or changes to the WordDelimiterFactory

      1. SOLR-1852_solr14branch.patch
        54 kB
        Robert Muir
      2. SOLR-1852_testcase.patch
        2 kB
        Robert Muir
      3. SOLR-1852.patch
        51 kB
        Peter Wolanin

        Activity

        Hide
        Peter Wolanin added a comment - - edited

        This patch was created by Mark Miller - it's a back port of Solr trunk code plus a tweak to let 1.4 compile

        With this updated WordDelimiterFilter if I reindex the bug seems to be fixed.

        In terms of the bug's symptoms to reproduce it, it looks as though Identi.ca is treated as phrase query as if I had quoted it like "Identi ca". That phrase search also fails. I had expected that Identi.ca would be the same as Identi ca (i.e. 2 separate tokens, not a phrase).

        Show
        Peter Wolanin added a comment - - edited This patch was created by Mark Miller - it's a back port of Solr trunk code plus a tweak to let 1.4 compile With this updated WordDelimiterFilter if I reindex the bug seems to be fixed. In terms of the bug's symptoms to reproduce it, it looks as though Identi.ca is treated as phrase query as if I had quoted it like "Identi ca". That phrase search also fails. I had expected that Identi.ca would be the same as Identi ca (i.e. 2 separate tokens, not a phrase).
        Hide
        Peter Wolanin added a comment -

        The changes in the patch originate at SOLR-1706 and SOLR-1657, however I don't think it's actually the same bug as SOLR-1706 intended to fix since the the admin analyzer interface the generated tokens look correct.

        Show
        Peter Wolanin added a comment - The changes in the patch originate at SOLR-1706 and SOLR-1657 , however I don't think it's actually the same bug as SOLR-1706 intended to fix since the the admin analyzer interface the generated tokens look correct.
        Hide
        Robert Muir added a comment -

        The changes in the patch originate at SOLR-1706 and SOLR-1657, however I don't think it's actually the same bug as SOLR-1706 intended to fix since the the admin analyzer interface the generated tokens look correct.

        Yeah, I don't like the situation at all, as its not obvious to me at a glance how the trunk impl fixes your problem, but at the same time how this changed behavior slipped passed the random tests on SOLR-1710.

        Show
        Robert Muir added a comment - The changes in the patch originate at SOLR-1706 and SOLR-1657 , however I don't think it's actually the same bug as SOLR-1706 intended to fix since the the admin analyzer interface the generated tokens look correct. Yeah, I don't like the situation at all, as its not obvious to me at a glance how the trunk impl fixes your problem, but at the same time how this changed behavior slipped passed the random tests on SOLR-1710 .
        Hide
        Robert Muir added a comment -

        ok, so your bug relates somehow to how the accumulated position increment gap is handled.

        This is how your stopword fits into the situation, somehow the new code is handling it "better" for your case, but perhaps its wrong.

        there are quite a few tests in TestWordDelimiter, which it passes, but I'll spend some time tonight verifying its correctness before we declare success...

        Show
        Robert Muir added a comment - ok, so your bug relates somehow to how the accumulated position increment gap is handled. This is how your stopword fits into the situation, somehow the new code is handling it "better" for your case, but perhaps its wrong. there are quite a few tests in TestWordDelimiter, which it passes, but I'll spend some time tonight verifying its correctness before we declare success...
        Hide
        Robert Muir added a comment -

        attached is a testcase demonstrating the bug.

        The problem is that if you have, for example "the lucene.solr", where "the" is a stopword, the Solr 1.4 WordDelimiter bumps the position increment of both "lucene" and "solr" tokens:

        • lucene (posInc=2)
        • solr (posInc=2)
        • lucenesolr (posInc=0)

        Instead it should look like:

        • lucene (posInc=2)
        • solr (posInc=1)
        • lucenesolr (posInc=0)

        In my opinion the behavior of trunk is correct, and this is a bug.
        But I don't know how to fix just Solr 1.4's WDF in a better way than dropping in the entire rewritten WDF...

        Show
        Robert Muir added a comment - attached is a testcase demonstrating the bug. The problem is that if you have, for example "the lucene.solr", where "the" is a stopword, the Solr 1.4 WordDelimiter bumps the position increment of both "lucene" and "solr" tokens: lucene (posInc=2) solr (posInc=2) lucenesolr (posInc=0) Instead it should look like: lucene (posInc=2) solr (posInc=1) lucenesolr (posInc=0) In my opinion the behavior of trunk is correct, and this is a bug. But I don't know how to fix just Solr 1.4's WDF in a better way than dropping in the entire rewritten WDF...
        Hide
        Robert Muir added a comment -

        I'm afraid of WDF, but I don't think I am the only one, and I think it would be good to fix this bug.

        If no one objects, I'd like to commit these patches (testcase and backport the trunk filter) to the 1.5 branch in a few days.

        Show
        Robert Muir added a comment - I'm afraid of WDF, but I don't think I am the only one, and I think it would be good to fix this bug. If no one objects, I'd like to commit these patches (testcase and backport the trunk filter) to the 1.5 branch in a few days.
        Hide
        Peter Wolanin added a comment -

        I'm confused by that comment - I thought this code is already in 1.5/trunk and the issue is backporting to the 1.4 branch?

        Show
        Peter Wolanin added a comment - I'm confused by that comment - I thought this code is already in 1.5/trunk and the issue is backporting to the 1.4 branch?
        Hide
        Robert Muir added a comment -

        Peter it is... but admittedly it has not been in trunk for very long, and WDF is pretty complex.

        It's a bit scary to backport a rewrite of it for this reason, but at the same time, we've got this bug
        and the other config bugs found in SOLR-1706, so I think its the right thing to do...

        Show
        Robert Muir added a comment - Peter it is... but admittedly it has not been in trunk for very long, and WDF is pretty complex. It's a bit scary to backport a rewrite of it for this reason, but at the same time, we've got this bug and the other config bugs found in SOLR-1706 , so I think its the right thing to do...
        Hide
        Robert Muir added a comment -

        Committed the test to trunk: revision 930262.

        Show
        Robert Muir added a comment - Committed the test to trunk: revision 930262.
        Hide
        Peter Wolanin added a comment -

        now this has been in trunk longer, do you feel any more confident about a back port?

        Show
        Peter Wolanin added a comment - now this has been in trunk longer, do you feel any more confident about a back port?
        Hide
        Robert Muir added a comment -

        now this has been in trunk longer, do you feel any more confident about a back port?

        I feel more confident about the new implementation of WordDelimiterFilter, yes.

        I suppose the question here is if the 1.5 branch is dead or not (no one seems to commit to it)

        Show
        Robert Muir added a comment - now this has been in trunk longer, do you feel any more confident about a back port? I feel more confident about the new implementation of WordDelimiterFilter, yes. I suppose the question here is if the 1.5 branch is dead or not (no one seems to commit to it)
        Hide
        Robert Muir added a comment -

        Also, Mark mentioned to me he had concerns about 'index back-compat'.

        Obviously, if we fix the bug, we 'break' this in the sense that you now index with correct positions...

        Show
        Robert Muir added a comment - Also, Mark mentioned to me he had concerns about 'index back-compat'. Obviously, if we fix the bug, we 'break' this in the sense that you now index with correct positions...
        Hide
        Peter Wolanin added a comment -

        I'm thinking about 1.4 backporting - not sure what's happening with 1.5

        Yes, you'd have to re-index if we have to backport to 1.4, but I assume that's only going to affect documents that would currently have broken searches?

        Show
        Peter Wolanin added a comment - I'm thinking about 1.4 backporting - not sure what's happening with 1.5 Yes, you'd have to re-index if we have to backport to 1.4, but I assume that's only going to affect documents that would currently have broken searches?
        Hide
        Robert Muir added a comment -

        I am willing to do the backport here if people want this in 1.4.1, just let me know.

        Show
        Robert Muir added a comment - I am willing to do the backport here if people want this in 1.4.1, just let me know.
        Hide
        Peter Wolanin added a comment -


        Yes, I'd propose to have this in 1.4.1 since it's a pretty serious bug in the places where it manifests.

        Show
        Peter Wolanin added a comment - Yes, I'd propose to have this in 1.4.1 since it's a pretty serious bug in the places where it manifests.
        Hide
        Robert Muir added a comment -

        here is the patch for solr 1.4. This also fixes SOLR-1706

        Show
        Robert Muir added a comment - here is the patch for solr 1.4. This also fixes SOLR-1706
        Hide
        Robert Muir added a comment -

        Committed revision 950711. Thanks Peter!

        Show
        Robert Muir added a comment - Committed revision 950711. Thanks Peter!
        Hide
        Mark Bennett added a comment -

        I realize this is closed, but I found a workaround for those who are still working with a pre-fix version.

        Just put the stopwords filter after the Word Delimiter filter. That worked for us without impacting much else, until we can get over to the new version.

        Show
        Mark Bennett added a comment - I realize this is closed, but I found a workaround for those who are still working with a pre-fix version. Just put the stopwords filter after the Word Delimiter filter. That worked for us without impacting much else, until we can get over to the new version.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Peter Wolanin
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development