Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6689

Odd analysis problem with WDF, appears to be triggered by preceding analysis components

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 4.8
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This problem shows up for me in Solr, but I believe the issue is down at the Lucene level, so I've opened the issue in the LUCENE project. We can move it if necessary.

      I've boiled the problem down to this minimum Solr fieldType:

          <fieldType name="testType" class="solr.TextField"
      sortMissingLast="true" positionIncrementGap="100">
            <analyzer type="index">
              <tokenizer
      class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory"
      rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
              <filter class="solr.PatternReplaceFilterFactory"
                pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
                replacement="$2"
              />
              <filter class="solr.WordDelimiterFilterFactory"
                splitOnCaseChange="1"
                splitOnNumerics="1"
                stemEnglishPossessive="1"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="1"
                catenateNumbers="1"
                catenateAll="0"
                preserveOriginal="1"
              />
            </analyzer>
            <analyzer type="query">
              <tokenizer
      class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory"
      rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
              <filter class="solr.PatternReplaceFilterFactory"
                pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
                replacement="$2"
              />
              <filter class="solr.WordDelimiterFilterFactory"
                splitOnCaseChange="1"
                splitOnNumerics="1"
                stemEnglishPossessive="1"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="0"
                catenateNumbers="0"
                catenateAll="0"
                preserveOriginal="0"
              />
            </analyzer>
          </fieldType>
      

      On Solr 4.7, if this type is given the input "aaa-bbb: ccc" then index analysis puts aaa at term position 1 and bbb at term position 2. This seems perfectly reasonable to me. In Solr 4.9, both terms end up at position 2. This causes phrase queries which used to work to return zero hits. The exact text of the phrase query is in the original documents that match on 4.7.

      If the custom rbbi (which is included unmodified from the lucene icu analysis source code) is not used, then the problem doesn't happen, because the punctuation doesn't make it to the PRF. If the PatternReplaceFilterFactory is not present, then the problem doesn't happen.

      I can work around the problem by setting luceneMatchVersion to 4.7, but I think the behavior is a bug, and I would rather not continue to use 4.7 analysis when I upgrade to 5.x, which I hope to do soon.

      Whether luceneMatchversion is LUCENE_47 or LUCENE_4_9, query analysis puts aaa at term position 1 and bbb at term position 2.

        Issue Links

          Activity

          Hide
          elyograg Shawn Heisey added a comment - - edited

          I chose the latter workaround – removing PRFF anywhere WDFF is also used.

          Show
          elyograg Shawn Heisey added a comment - - edited I chose the latter workaround – removing PRFF anywhere WDFF is also used.
          Hide
          elyograg Shawn Heisey added a comment -

          Thinking about this in more detail, another workaround is to remove PRFF entirely, at least from analysis chains where it is followed by WDFF. WDF appears to remove that punctuation anyway, and it looks like it does it correctly for my purposes.

          I still believe that the behavior I'm seeing is a bug, even if there are at least two viable workarounds that will work for my situation. Use cases may exist (more valid than mine) where the user needs pattern replace filter right before word delimiter filter.

          Show
          elyograg Shawn Heisey added a comment - Thinking about this in more detail, another workaround is to remove PRFF entirely, at least from analysis chains where it is followed by WDFF. WDF appears to remove that punctuation anyway, and it looks like it does it correctly for my purposes. I still believe that the behavior I'm seeing is a bug, even if there are at least two viable workarounds that will work for my situation. Use cases may exist (more valid than mine) where the user needs pattern replace filter right before word delimiter filter.
          Hide
          elyograg Shawn Heisey added a comment -

          I have simplified the analysis chain, removing the ICU tokenizer and replacing it with the whitespace tokenizer. The root problem appears to be an interaction between PatternReplaceFilter and WordDelimiterFilter.

          With the following Solr analysis chain, an indexed value of "aaa-bbb: ccc" will not be found by a phrase search of "aaa bbb" because the positions on the two query terms don't match what's in the index. The positions go wrong on the WordDelimiterFilter step.

              <fieldType name="genText2" class="solr.TextField" sortMissingLast="true" positionIncrementGap="100">
                <analyzer>
                  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                  <filter class="solr.PatternReplaceFilterFactory"
                    pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
                    replacement="$2"
                  />
                  <filter class="solr.WordDelimiterFilterFactory"
                    splitOnCaseChange="1"
                    splitOnNumerics="1"
                    stemEnglishPossessive="1"
                    generateWordParts="1"
                    generateNumberParts="1"
                    catenateWords="1"
                    catenateNumbers="1"
                    catenateAll="0"
                    preserveOriginal="1"
                  />
                </analyzer>
              </fieldType>
          

          If I remove PRFF from the above chain, the problem goes away. This filter is in the chain so that leading and trailing punctuation are removed from all terms, leaving punctuation inside the term for WDF to handle.

          An additional problem with the analysis quoted above is that the "aaabbb" term is indexed at position 2. I believe it should be at position 1. This problem is also fixed by removing PRFF.

          Show
          elyograg Shawn Heisey added a comment - I have simplified the analysis chain, removing the ICU tokenizer and replacing it with the whitespace tokenizer. The root problem appears to be an interaction between PatternReplaceFilter and WordDelimiterFilter. With the following Solr analysis chain, an indexed value of "aaa-bbb: ccc" will not be found by a phrase search of "aaa bbb" because the positions on the two query terms don't match what's in the index. The positions go wrong on the WordDelimiterFilter step. <fieldType name= "genText2" class= "solr.TextField" sortMissingLast= " true " positionIncrementGap= "100" > <analyzer> <tokenizer class= "solr.WhitespaceTokenizerFactory" /> <filter class= "solr.PatternReplaceFilterFactory" pattern= "^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement= "$2" /> <filter class= "solr.WordDelimiterFilterFactory" splitOnCaseChange= "1" splitOnNumerics= "1" stemEnglishPossessive= "1" generateWordParts= "1" generateNumberParts= "1" catenateWords= "1" catenateNumbers= "1" catenateAll= "0" preserveOriginal= "1" /> </analyzer> </fieldType> If I remove PRFF from the above chain, the problem goes away. This filter is in the chain so that leading and trailing punctuation are removed from all terms, leaving punctuation inside the term for WDF to handle. An additional problem with the analysis quoted above is that the "aaabbb" term is indexed at position 2. I believe it should be at position 1. This problem is also fixed by removing PRFF.
          Hide
          elyograg Shawn Heisey added a comment -

          I have just found a better workaround: The luceneMatchVersion can be specified on each analysis component, so I can apply it only to the WordDelimiterFilterFactory on the index analysis.

          I hope this problem will still be fixed.

          Show
          elyograg Shawn Heisey added a comment - I have just found a better workaround: The luceneMatchVersion can be specified on each analysis component, so I can apply it only to the WordDelimiterFilterFactory on the index analysis. I hope this problem will still be fixed.
          Hide
          elyograg Shawn Heisey added a comment -

          I have just confirmed that setting luceneMatchVersion to 4.7 when running the previously described test on 5.2.1 will fix the problem. This means I have a workaround, but it's not one that I'm really very happy with.

          Show
          elyograg Shawn Heisey added a comment - I have just confirmed that setting luceneMatchVersion to 4.7 when running the previously described test on 5.2.1 will fix the problem. This means I have a workaround, but it's not one that I'm really very happy with.
          Hide
          elyograg Shawn Heisey added a comment -

          I can work around the specific queries that caused the problem if I make index and query WDF analysis exactly the same ... but there's a problem even then.

          As a test, I entirely removed the query analysis above and removed the "type" attribute from the index analysis so it applies to both. I put this fieldType into Solr 5.2.1 and went to the analysis screen.

          A phrase search for "aaa bbb" when the indexed value was "aaa-bbb: ccc" does not match, because the positions are wrong. I believe that it should match. A user would most likely expect it to match.

          Show
          elyograg Shawn Heisey added a comment - I can work around the specific queries that caused the problem if I make index and query WDF analysis exactly the same ... but there's a problem even then. As a test, I entirely removed the query analysis above and removed the "type" attribute from the index analysis so it applies to both. I put this fieldType into Solr 5.2.1 and went to the analysis screen. A phrase search for "aaa bbb" when the indexed value was "aaa-bbb: ccc" does not match, because the positions are wrong. I believe that it should match. A user would most likely expect it to match.
          Hide
          elyograg Shawn Heisey added a comment -

          LUCENE-5111 seems to contain the commit that causes this behavior.

          Show
          elyograg Shawn Heisey added a comment - LUCENE-5111 seems to contain the commit that causes this behavior.

            People

            • Assignee:
              Unassigned
              Reporter:
              elyograg Shawn Heisey
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development