Solr
  1. Solr
  2. SOLR-14

Add the ability to preserve the original term when using WordDelimiterFilter

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3
    • Component/s: search
    • Labels:
      None

      Description

      When doing prefix searching, you need to hang on to the original term othewise you'll miss many matches you should be making.

      Data: ABC-12345
      WordDelimiterFitler may change this into
      ABC 12345 ABC12345

      A user may enter a search such as
      ABC-123*
      Which will fail to find a match given the above scenario.

      The attached patch will allow the use of the "preserveOriginal" option to WordDelimiterFilter and will analyse as
      ABC 12345 ABC12345 ABC-12345
      in which case we will get a postive match.

      1. SOLR-14.patch
        12 kB
        Yonik Seeley
      2. SOLR-14.patch
        10 kB
        Geoffrey Young
      3. SOLR-14.patch
        9 kB
        Ankur Madnani
      4. SOLR-14.patch
        10 kB
        Geoffrey Young
      5. TokenizerFactory.java
        0.9 kB
        Richard "Trey" Hyde
      6. WordDelimiterFilter.patch
        5 kB
        Richard "Trey" Hyde
      7. WordDelimiterFilter.patch
        2 kB
        Richard "Trey" Hyde

        Activity

        Hide
        Yonik Seeley added a comment -

        Thanks for the patch Trey!

        Can you give an example with the resulting token positions (or positionIncrements?)

        Also, is there an easy way to prevent duplicate tokens from being produced (the preserveOriginal version will often be identical to catenateWords or catenateNumbers, right?)

        Show
        Yonik Seeley added a comment - Thanks for the patch Trey! Can you give an example with the resulting token positions (or positionIncrements?) Also, is there an easy way to prevent duplicate tokens from being produced (the preserveOriginal version will often be identical to catenateWords or catenateNumbers, right?)
        Hide
        Richard "Trey" Hyde added a comment -

        IMO, it does the wrong thing with position. I think it should have the position of the original term, I'll see if I can't figure out how to do that. It also doesn't work properly in the "single token" optimized case, another easy fix. As far as preventing duplicate tokens, an iteration though the token queue with equals should do the trick, no?

        Show
        Richard "Trey" Hyde added a comment - IMO, it does the wrong thing with position. I think it should have the position of the original term, I'll see if I can't figure out how to do that. It also doesn't work properly in the "single token" optimized case, another easy fix. As far as preventing duplicate tokens, an iteration though the token queue with equals should do the trick, no?
        Hide
        Hoss Man added a comment -

        It would probably be good to make sure we have some UnitTests of the existing WDF behavior prior to applying this patch, and then some tests that use this new feature just so it's clera how it works in various situations.

        As for duplicates: my initial thought was that this could be handled by the proposed Filter in SOLR-11... but then i realized yonik has a point: the common case is probably going to be no intra-word delimiters, so a short circut check that doesn't crete two of every token would probably be better

        Show
        Hoss Man added a comment - It would probably be good to make sure we have some UnitTests of the existing WDF behavior prior to applying this patch, and then some tests that use this new feature just so it's clera how it works in various situations. As for duplicates: my initial thought was that this could be handled by the proposed Filter in SOLR-11 ... but then i realized yonik has a point: the common case is probably going to be no intra-word delimiters, so a short circut check that doesn't crete two of every token would probably be better
        Hide
        Richard "Trey" Hyde added a comment -

        Ok, this one actually works in a few more cases.

        There is still term duping if numotk > 1 and there are no intraword delimiters in the original string.

        Show
        Richard "Trey" Hyde added a comment - Ok, this one actually works in a few more cases. There is still term duping if numotk > 1 and there are no intraword delimiters in the original string.
        Hide
        Geoffrey Young added a comment -

        this is fairly important for our ongoing implementation - can someone with appropriate karma take steps to get this into 1.3?

        thanks

        Show
        Geoffrey Young added a comment - this is fairly important for our ongoing implementation - can someone with appropriate karma take steps to get this into 1.3? thanks
        Hide
        Yonik Seeley added a comment -

        Someone would need to make an up-to-date patch that works (pref w/ some tests).

        Show
        Yonik Seeley added a comment - Someone would need to make an up-to-date patch that works (pref w/ some tests).
        Hide
        Mike Klaas added a comment -

        Note that it is very easy to use an external TokenFilter, so you could just c&p WDF into your own class and make the changes.

        (Though I'm not saying that this shouldn't make it in for 1.3)

        Show
        Mike Klaas added a comment - Note that it is very easy to use an external TokenFilter, so you could just c&p WDF into your own class and make the changes. (Though I'm not saying that this shouldn't make it in for 1.3)
        Hide
        Mike Klaas added a comment -

        Also, voting for an issue is a good way to increase its visibility

        Show
        Mike Klaas added a comment - Also, voting for an issue is a good way to increase its visibility
        Hide
        Geoffrey Young added a comment -

        ok, I've given this a shot. I'm an an open-source guy, even an ASF guy, but not a java guy, so forgive my code

        the patch should apply cleanly to current trunk. the last patch before mine still had some issues that needed to be worked through wrt term duplication. this patch should work a bit better.

        all current tests pass when adjusted to 'preserveOriginal=0' (default behavior, same as 1.2). I looked at augmenting the current tests for WDF and 'preserveOriginal=1' but it's beyond my current java abilities.

        installing the patch and running the analyzer yields stuff like this:

        foo => foo
        foo-bar => foo-bar foo bar foobar
        foo-bar baz => foo-bar foo bar foobar baz
        foo! => foo foo!

        which seems reasonable to me.

        a little shepherding would be awesome.

        thanks

        --Geoff

        Show
        Geoffrey Young added a comment - ok, I've given this a shot. I'm an an open-source guy, even an ASF guy, but not a java guy, so forgive my code the patch should apply cleanly to current trunk. the last patch before mine still had some issues that needed to be worked through wrt term duplication. this patch should work a bit better. all current tests pass when adjusted to 'preserveOriginal=0' (default behavior, same as 1.2). I looked at augmenting the current tests for WDF and 'preserveOriginal=1' but it's beyond my current java abilities. installing the patch and running the analyzer yields stuff like this: foo => foo foo-bar => foo-bar foo bar foobar foo-bar baz => foo-bar foo bar foobar baz foo! => foo foo! which seems reasonable to me. a little shepherding would be awesome. thanks --Geoff
        Hide
        Ankur Madnani added a comment -

        The junit test for preserveOriginal=1 is added to this patch.
        This patch contains the new schema.xml for test and changed TestWordDelimiterFilter.

        The patch given earlier by Geoff. changes the signature of the public constructors in the WordDelimiterFilter by adding preserveOriginal as parameter.
        May be we should create separate constructor which accepts preserveOriginal, to
        preserve Backwards Compatibility.

        Show
        Ankur Madnani added a comment - The junit test for preserveOriginal=1 is added to this patch. This patch contains the new schema.xml for test and changed TestWordDelimiterFilter. The patch given earlier by Geoff. changes the signature of the public constructors in the WordDelimiterFilter by adding preserveOriginal as parameter. May be we should create separate constructor which accepts preserveOriginal, to preserve Backwards Compatibility.
        Hide
        Geoffrey Young added a comment -

        this new patch addresses three additional cases

        o words prefixed with delimiters
        o words postfixed with delimiters
        o words that are all delimiters

        there's a special place for people who name themselves !!! and expect to be found.

        the input string

          test 404-123 $foo bar& beer !!! *foo baz's biff
        

        produces

        test 404-123 404 123 404123 foo $foo bar bar& beer !!! foo *foo baz baz's biff
        

        using options

        org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1, generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=1, catenateNumbers=1}
        
        Show
        Geoffrey Young added a comment - this new patch addresses three additional cases o words prefixed with delimiters o words postfixed with delimiters o words that are all delimiters there's a special place for people who name themselves !!! and expect to be found. the input string test 404-123 $foo bar& beer !!! *foo baz's biff produces test 404-123 404 123 404123 foo $foo bar bar& beer !!! foo *foo baz baz's biff using options org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1, generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=1, catenateNumbers=1}
        Hide
        Yonik Seeley added a comment -

        Attaching new version of patch.

        The previous version had position issues, and also had issues with certain flag combos.
        I changed the strategy by handling "preserveOriginal" outside of the main loop (anywhere there is a "break" that falls through) and then just returning the original token first and adjusting the offset of the next token to overlap.

        This should also be faster as it avoids token copying in the common case.

        Show
        Yonik Seeley added a comment - Attaching new version of patch. The previous version had position issues, and also had issues with certain flag combos. I changed the strategy by handling "preserveOriginal" outside of the main loop (anywhere there is a "break" that falls through) and then just returning the original token first and adjusting the offset of the next token to overlap. This should also be faster as it avoids token copying in the common case.
        Hide
        Geoffrey Young added a comment -

        looks good from a functional pov. the ordering of the tokens looks funny (to me) in analysis.jsp, but all the right ones are there.

        it still takes my full load roughly twice as long to index (3.5 minutes versus 1.75 minutes for a 120MB file) but the functionality is important enough to incur the cost.

        thanks muchly.

        Show
        Geoffrey Young added a comment - looks good from a functional pov. the ordering of the tokens looks funny (to me) in analysis.jsp, but all the right ones are there. it still takes my full load roughly twice as long to index (3.5 minutes versus 1.75 minutes for a 120MB file) but the functionality is important enough to incur the cost. thanks muchly.
        Hide
        Yonik Seeley added a comment -

        the ordering of the tokens looks funny (to me) in analysis.jsp

        Currently xxx-yyy is expanded to xxx yyy/xxxyyy (xxx is in the first position, yyy and xxxyyy are in the second position).
        If we also want to index the original xxx-yyy, it's a bit arbitrary if it goes in the same position as the xxx or the yyy.
        In this case I put it with xxx since the original token can then just be returned first (and retain everything about the original token such as it's payload, w/o having to clone it).

        Show
        Yonik Seeley added a comment - the ordering of the tokens looks funny (to me) in analysis.jsp Currently xxx-yyy is expanded to xxx yyy/xxxyyy (xxx is in the first position, yyy and xxxyyy are in the second position). If we also want to index the original xxx-yyy, it's a bit arbitrary if it goes in the same position as the xxx or the yyy. In this case I put it with xxx since the original token can then just be returned first (and retain everything about the original token such as it's payload, w/o having to clone it).
        Hide
        Yonik Seeley added a comment -

        I committed after adding a performance test to verify that performance doesn't noticeably degrade for preserveOriginal=false.
        Thanks everyone!

        Show
        Yonik Seeley added a comment - I committed after adding a performance test to verify that performance doesn't noticeably degrade for preserveOriginal=false. Thanks everyone!

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Richard "Trey" Hyde
          • Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development