Details

    • Type: Wish Wish
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.0
    • Component/s: search
    • Labels:
      None

      Description

      I recently noticed a situation in which my Query analyzer was producing the same Token more then once, resulting in it getting two equally boosted clauses in the resulting query. In my specific case, i was using the same synonym file for multiple fields (some stemmed some not) and two synonyms for a word stemmed to the same root, which ment that particular word was worth twice as as any of the other variations of the synonym – but I can imagine other situations where this might come up, both at index time and at query time, particularlay when using SynonymFilter in combination with the WordDelimiter filter.

      It occured to me that a DeDupFilter would be handy. In it's simplest form it would drop any Token it gets where the startOffset, endOffset,termText,and type are all identical to the previous token and the positionIncriment is 0. A more robust implimentation might support init options indicating that only certain combinations of those things should be used to determine equality (ie: just termText, just termText and positionIncriment=0, etc...) but in this case, an option might also be neccessary to determine with of the Tokens should be propogated (the first of the last)

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          This bug was modified as part of a bulk update using the criteria...

          • Marked ("Resolved" or "Closed") and "Fixed"
          • Had no "Fix Version" versions
          • Was listed in the CHANGES.txt for 1.1

          The Fix Version for all 38 issues found was set to 1.1, email notification
          was suppressed to prevent excessive email.

          For a list of all the issues modified, search jira comments for this
          (hopefully) unique string: 20080415hossman3

          Show
          Hoss Man added a comment - This bug was modified as part of a bulk update using the criteria... Marked ("Resolved" or "Closed") and "Fixed" Had no "Fix Version" versions Was listed in the CHANGES.txt for 1.1 The Fix Version for all 38 issues found was set to 1.1, email notification was suppressed to prevent excessive email. For a list of all the issues modified, search jira comments for this (hopefully) unique string: 20080415hossman3
          Hide
          Hoss Man added a comment -

          patch commited as is, plus some additional test cases and example schema usages.

          Show
          Hoss Man added a comment - patch commited as is, plus some additional test cases and example schema usages.
          Hide
          Yonik Seeley added a comment -

          I looked it over quick, looks fine to me!

          A few weeks ago I did some more premature optimization, writing a circular queue (power-of-two based) that's about twice as fast as a LinkedList for our typical usage. I was intending it for use in BufferedTokenFilter if insertion/removal of tokens in the middle of the buffers was unneeded (or rare, as it could be implemented).

          Anyway, I'm attaching it here for lack of a better place. I support committing the current BufferedTokenFilter as-is, since I doubt the LinkedList implementation will be any kind of bottleneck.

          Show
          Yonik Seeley added a comment - I looked it over quick, looks fine to me! A few weeks ago I did some more premature optimization, writing a circular queue (power-of-two based) that's about twice as fast as a LinkedList for our typical usage. I was intending it for use in BufferedTokenFilter if insertion/removal of tokens in the middle of the buffers was unneeded (or rare, as it could be implemented). Anyway, I'm attaching it here for lack of a better place. I support committing the current BufferedTokenFilter as-is, since I doubt the LinkedList implementation will be any kind of bottleneck.
          Hide
          Hoss Man added a comment -

          Back in this email...

          http://www.nabble.com/BufferingTokenStream-and-RemoveDups-p4320716.html

          ...yonik posted an off the cuff solution to this probem that also included a nice reusable "BufferedTokenStream"

          I've cleaned it up, added some tests, and fixed the bugs the tests surfaced (mainly the infinite loops and the fact that the dup detection ignored every token with a non zero position gap thta was followed by a 0 position gap)

          I'll commit this in the next few days unless anyone has any objections/comments regarding ways to imporve it. the RemoveDuplicatesTokenFilter.process method is much less elegant then Yonik's orriginal version, but that's the only way i could get it to work. I'd welcome any suggestions on regaining the elegance without breaking the tests.

          Show
          Hoss Man added a comment - Back in this email... http://www.nabble.com/BufferingTokenStream-and-RemoveDups-p4320716.html ...yonik posted an off the cuff solution to this probem that also included a nice reusable "BufferedTokenStream" I've cleaned it up, added some tests, and fixed the bugs the tests surfaced (mainly the infinite loops and the fact that the dup detection ignored every token with a non zero position gap thta was followed by a 0 position gap) I'll commit this in the next few days unless anyone has any objections/comments regarding ways to imporve it. the RemoveDuplicatesTokenFilter.process method is much less elegant then Yonik's orriginal version, but that's the only way i could get it to work. I'd welcome any suggestions on regaining the elegance without breaking the tests.
          Hide
          Richard "Trey" Hyde added a comment -

          YMMV, I had to put this together rather quickly this morning to dedupe tokens in different positions. But again, that's a result of my imperfect SOLR-14 patch. I just thought I'd get a little discussion started on the topic. I'd like to see some solution integrated.

          Re BufferedTokenFilter. That would be great. It's much easier to deal with context and symantics when you have forward and reverse state... especially for non-Java-literate people like myself.

          Show
          Richard "Trey" Hyde added a comment - YMMV, I had to put this together rather quickly this morning to dedupe tokens in different positions. But again, that's a result of my imperfect SOLR-14 patch. I just thought I'd get a little discussion started on the topic. I'd like to see some solution integrated. Re BufferedTokenFilter. That would be great. It's much easier to deal with context and symantics when you have forward and reverse state... especially for non-Java-literate people like myself.
          Hide
          Yonik Seeley added a comment -

          Another issue is memory use for very large fields (the current impl doesn't stream).

          It seems like a lot of token filter implementations need some kind of buffering that could be refactored into a TokenQueue, or perhaps a base-class BufferedTokenFilter.

          Show
          Yonik Seeley added a comment - Another issue is memory use for very large fields (the current impl doesn't stream). It seems like a lot of token filter implementations need some kind of buffering that could be refactored into a TokenQueue, or perhaps a base-class BufferedTokenFilter.
          Hide
          Hoss Man added a comment -

          DeDuping purely on tem text without any regard for position seems a litle extreme for most cases ... it will seriously throw off term frequencies and phrase queries.

          I'm sure it has benefits, but i'd prefer a solution where it's that behavior is an "option" and the defualt is to only rmeove duplicate tokens at the same position.

          Show
          Hoss Man added a comment - DeDuping purely on tem text without any regard for position seems a litle extreme for most cases ... it will seriously throw off term frequencies and phrase queries. I'm sure it has benefits, but i'd prefer a solution where it's that behavior is an "option" and the defualt is to only rmeove duplicate tokens at the same position.
          Hide
          Richard "Trey" Hyde added a comment -

          Alternate implementation based on LinkedHashMap instead of a LinkedList queue. Should work better for larger data sets. Still doesn't pay attention to position but I don't know enough to know if that is important.

          Show
          Richard "Trey" Hyde added a comment - Alternate implementation based on LinkedHashMap instead of a LinkedList queue. Should work better for larger data sets. Still doesn't pay attention to position but I don't know enough to know if that is important.
          Hide
          Richard "Trey" Hyde added a comment -

          re algorith,: Yes, it would not be advisable for large fields.
          I'm not all that familiar with the subtleties of position but for at least some of my data, I am geting duplicates in other posistions.

          Show
          Richard "Trey" Hyde added a comment - re algorith,: Yes, it would not be advisable for large fields. I'm not all that familiar with the subtleties of position but for at least some of my data, I am geting duplicates in other posistions.
          Hide
          Yonik Seeley added a comment -

          It looks like this implementation is O(n^2), right?
          It also looks like it removes all duplicates for a whole field, not limiting itself to those tokens with a positionIncrement=0. Is this the indended behavior?

          Show
          Yonik Seeley added a comment - It looks like this implementation is O(n^2), right? It also looks like it removes all duplicates for a whole field, not limiting itself to those tokens with a positionIncrement=0. Is this the indended behavior?
          Hide
          Richard "Trey" Hyde added a comment -

          Sorry for no factory class. My code repo is way too divergent for a clean diff at the moment but the factory is fairly obvious.

          Show
          Richard "Trey" Hyde added a comment - Sorry for no factory class. My code repo is way too divergent for a clean diff at the moment but the factory is fairly obvious.
          Hide
          Richard "Trey" Hyde added a comment -

          This one is working well for me. It also fixes a number of the problems that came up after my SOLR-14 patch.

          Show
          Richard "Trey" Hyde added a comment - This one is working well for me. It also fixes a number of the problems that came up after my SOLR-14 patch.

            People

            • Assignee:
              Hoss Man
              Reporter:
              Hoss Man
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development