Solr
  1. Solr
  2. SOLR-2378

FST-based Lookup (suggestions) for prefix matches.

    Details

      Description

      Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST package). This issue is for implementing a relatively basic prefix matcher, we will handle infixes and other types of input matches gradually. Impl. phases:

      • write a DFA based suggester effectively identical to ternary tree based solution right now,
      • baseline benchmark against tern. tree (memory consumption, rebuilding speed, indexing speed; reuse Andrzej's benchmark code)
      • modify DFA to encode term weights directly in the automaton (optimize for onlyMostPopular case)
      • benchmark again
      • benchmark again
      • modify the tutorial on the wiki http://wiki.apache.org/solr/Suggester
      1. SOLR-2378.patch
        55 kB
        Dawid Weiss

        Issue Links

          Activity

          Hide
          Dawid Weiss added a comment -

          Hi guys. I probably need those admin rights on SOLR's JIRA project as well because I can't assign this to myself.

          Show
          Dawid Weiss added a comment - Hi guys. I probably need those admin rights on SOLR's JIRA project as well because I can't assign this to myself.
          Hide
          Uwe Schindler added a comment -

          This time I cannot help - I don't have admin rights in SOLR! In my opinion, we should merge the user rights of both projects.

          Show
          Uwe Schindler added a comment - This time I cannot help - I don't have admin rights in SOLR! In my opinion, we should merge the user rights of both projects.
          Hide
          Mark Miller added a comment -

          Hi guys. I probably need those admin rights on SOLR's JIRA project as well because I can't assign this to myself

          You should have a committer role now.

          Show
          Mark Miller added a comment - Hi guys. I probably need those admin rights on SOLR's JIRA project as well because I can't assign this to myself You should have a committer role now.
          Hide
          Dawid Weiss added a comment -

          I didn't have time to take care of this until now, apologies. So, looking at Lookup#lookup(), I just wanted to clarify:

            /**
             * Look up a key and return possible completion for this key.
             * @param key lookup key. Depending on the implementation this may be
             * a prefix, misspelling, or even infix.
             * @param onlyMorePopular return only more popular results
             * @param num maximum number of results to return
             * @return a list of possible completions, with their relative weight (e.g. popularity)
             */
            public abstract List<LookupResult> lookup(String key, boolean onlyMorePopular, int num);
          

          the "onlyMorePopular" means more popular than... what? I see TSTLookup and JaspellLookup (Andrzej, will you confirm, please?) sorts matches in a priority queue by their associated value (frequency I guess). This makes sense, but onlyMorePopular is misleading – it should be called onlyMostPopular (those with the native knowledge of English subtlieties, speak up if I'm right here).

          I also see and wanted to confirm – the Dictionary can come from various sources, so we can't rely on the presence of the built-in Lucene automaton, can we? Even if I wanted to reuse it, there'd be no easy way to determine if it's a full automaton, or a partial one (because of the gaps/trimming)... I think I'll just implement the solution by building the automaton from whatever Dictionary comes in and serializing/ deserializing it similar to TSTLookup.

          Sounds ok?

          Show
          Dawid Weiss added a comment - I didn't have time to take care of this until now, apologies. So, looking at Lookup#lookup(), I just wanted to clarify: /** * Look up a key and return possible completion for this key. * @param key lookup key. Depending on the implementation this may be * a prefix, misspelling, or even infix. * @param onlyMorePopular return only more popular results * @param num maximum number of results to return * @ return a list of possible completions, with their relative weight (e.g. popularity) */ public abstract List<LookupResult> lookup( String key, boolean onlyMorePopular, int num); the "onlyMorePopular" means more popular than... what? I see TSTLookup and JaspellLookup (Andrzej, will you confirm, please?) sorts matches in a priority queue by their associated value (frequency I guess). This makes sense, but onlyMorePopular is misleading – it should be called onlyMostPopular (those with the native knowledge of English subtlieties, speak up if I'm right here). I also see and wanted to confirm – the Dictionary can come from various sources, so we can't rely on the presence of the built-in Lucene automaton, can we? Even if I wanted to reuse it, there'd be no easy way to determine if it's a full automaton, or a partial one (because of the gaps/trimming)... I think I'll just implement the solution by building the automaton from whatever Dictionary comes in and serializing/ deserializing it similar to TSTLookup. Sounds ok?
          Hide
          Andrzej Bialecki added a comment -

          I see TSTLookup and JaspellLookup (Andrzej, will you confirm, please?) sorts matches in a priority queue by their associated value (frequency I guess)

          Correct. I agree that the name is so-so, I inherited it from the spellchecker API - feel free to change it.

          the Dictionary can come from various sources, ...

          Yes. This is again a legacy of the Lucene SpellChecker API. I tried to extend this API in Solr without changing it in Lucene (see the *IteratorWrapper classes and TermFreqIterator) but ultimately it would be better to refactor this.

          Show
          Andrzej Bialecki added a comment - I see TSTLookup and JaspellLookup (Andrzej, will you confirm, please?) sorts matches in a priority queue by their associated value (frequency I guess) Correct. I agree that the name is so-so, I inherited it from the spellchecker API - feel free to change it. the Dictionary can come from various sources, ... Yes. This is again a legacy of the Lucene SpellChecker API. I tried to extend this API in Solr without changing it in Lucene (see the *IteratorWrapper classes and TermFreqIterator) but ultimately it would be better to refactor this.
          Hide
          Dawid Weiss added a comment -

          The original suggester issue.

          Show
          Dawid Weiss added a comment - The original suggester issue.
          Hide
          Dawid Weiss added a comment -

          Soliciting feedback on the following questions:

          • suggesters currently have float weights associated with terms; can these floats be bucketed and returned in approximation or do they need to be exact copies of the input? For automata, bucketed weights (to, let's say 5-10 different values) provide terrific speed/size improvements, so if this is not a rigid requirement, I'd use them.
          • is there any info on threading of Solr components? I am in particular looking for mutable object fields in the suggester (can a single suggester instance be accessed by multiple threads at the same time)?

          I've implemented preliminary FST-based lookup (without weights yet). Speed-wise it doesn't rock because data is converted to/from utf8 on input/output and sorted during construction, but it is still acceptable, even at this early stage I think:

          JaspellLookup        buildTime[ms]=112 lookupTime[ms]=288
          TSTLookup            buildTime[ms]=115 lookupTime[ms]=103
          FSTLookup            buildTime[ms]=464 lookupTime[ms]=145
          

          now... that was speed only, check out the in-memory size

          JaspellLookup        size[B]=81,078,997
          TSTLookup            size[B]=53,453,696
          FSTLookup            size[B]=2,909,396
          

          (This benchmark stores very limited vocabulary items – long numbers only, so it is skewed from reality, but it's nice to see something like this, huh?).

          Show
          Dawid Weiss added a comment - Soliciting feedback on the following questions: suggesters currently have float weights associated with terms; can these floats be bucketed and returned in approximation or do they need to be exact copies of the input? For automata, bucketed weights (to, let's say 5-10 different values) provide terrific speed/size improvements, so if this is not a rigid requirement, I'd use them. is there any info on threading of Solr components? I am in particular looking for mutable object fields in the suggester (can a single suggester instance be accessed by multiple threads at the same time)? I've implemented preliminary FST-based lookup (without weights yet). Speed-wise it doesn't rock because data is converted to/from utf8 on input/output and sorted during construction, but it is still acceptable, even at this early stage I think: JaspellLookup buildTime[ms]=112 lookupTime[ms]=288 TSTLookup buildTime[ms]=115 lookupTime[ms]=103 FSTLookup buildTime[ms]=464 lookupTime[ms]=145 now... that was speed only, check out the in-memory size JaspellLookup size[B]=81,078,997 TSTLookup size[B]=53,453,696 FSTLookup size[B]=2,909,396 (This benchmark stores very limited vocabulary items – long numbers only, so it is skewed from reality, but it's nice to see something like this, huh?).
          Hide
          Ryan McKinley added a comment -

          is there any info on threading of Solr components? I am in particular looking for mutable object fields in the suggester (can a single suggester instance be accessed by multiple threads at the same time)?

          Components are initalized at startup and the same instance is used for every request (multi-threaded)

          If you need to use the same obejects in prepare and process, you can either put then in the request context map, or add something to ResponseBuilder (perhaps better if this is widely used)

          Show
          Ryan McKinley added a comment - is there any info on threading of Solr components? I am in particular looking for mutable object fields in the suggester (can a single suggester instance be accessed by multiple threads at the same time)? Components are initalized at startup and the same instance is used for every request (multi-threaded) If you need to use the same obejects in prepare and process, you can either put then in the request context map, or add something to ResponseBuilder (perhaps better if this is widely used)
          Hide
          Dawid Weiss added a comment -

          Thanks Ryan. Well, I'll just allocate off the heap, hoping it'll fit in TLB.

          I will also need a reasonably large list of suggestions to test/benchmark. By realistic I mean something that folks actually use in production (terms & weights) would be great; a sample of a few hundred thousand terms would be probably good enough . I can come up with something of my own (based on wikipedia or Carrot2 data), but a real data set would be better.

          Show
          Dawid Weiss added a comment - Thanks Ryan. Well, I'll just allocate off the heap, hoping it'll fit in TLB. I will also need a reasonably large list of suggestions to test/benchmark. By realistic I mean something that folks actually use in production (terms & weights) would be great; a sample of a few hundred thousand terms would be probably good enough . I can come up with something of my own (based on wikipedia or Carrot2 data), but a real data set would be better.
          Hide
          Dawid Weiss added a comment -

          I'm done for the day. Lots of benchmarking and tweaking. Conclusions:

          • Use raw utf16 values of input strings as they are ready to use in the input Strings and don't need to be converted. The automaton is slightly bigger, but not that much (read: most likely because English ascii chars fall into single byte vint range anyway).
          • I created a non-recursive and recursive version of the main tail-state traversal loop, but the gain is minimal.
          • I have a relatively fast machine (i7, quad core) and parallel GC combined with tlabs makes local allocations nearly zero-cost. Locally reused structures speed up by 5%, but increase code complexity greatly.
          • The benchmark currently in the codebase is very skewed. I tried on real life data (that I cannot share, unfortunately) and the results are:
            TSTLookup            size[B]=12,598,794
            FSTLookup            size[B]=472,679
            * Running 10 iterations for TSTLookup ...
              - warm-up 5 iterations...
              - main iterations:
            TSTLookup            buildTime[ms]=19 lookupTime[ms]=4,813
            * Running 10 iterations for FSTLookup ...
              - warm-up 5 iterations...
              - main iterations:
            FSTLookup            buildTime[ms]=68 lookupTime[ms]=263
            

            It looks as if I've made a mistake, but not really. With the actual data the fan-out rate is much higher than on long numbers. Also, real terms are shorter than longs and 75% of their prefix is usually a 2-3 letter combination resulting in lots of matches that need to be passed through the priority queue, etc. With the FST weights are part of the search structure (approximated weights, to be exact) and so the cost of building it is higher, but the gain at runtime is substantial. The lookup time should be virtually constant (with very small deviation).

          • JaspellLookup does not pass the tests if the input prefix has a space inside . TSTLookup and FSTLookup work just fine.
          • I compared against automata API in Morfologik; the speedup is about 30% (only ints are passed, no Arc object decompression, etc.). But then – the API of Lucene is more flexible.

          I'll clean up the code and create a final patch for testing/ evaluation tomorrow.

          Show
          Dawid Weiss added a comment - I'm done for the day. Lots of benchmarking and tweaking. Conclusions: Use raw utf16 values of input strings as they are ready to use in the input Strings and don't need to be converted. The automaton is slightly bigger, but not that much (read: most likely because English ascii chars fall into single byte vint range anyway). I created a non-recursive and recursive version of the main tail-state traversal loop, but the gain is minimal. I have a relatively fast machine (i7, quad core) and parallel GC combined with tlabs makes local allocations nearly zero-cost. Locally reused structures speed up by 5%, but increase code complexity greatly. The benchmark currently in the codebase is very skewed. I tried on real life data (that I cannot share, unfortunately) and the results are: TSTLookup size[B]=12,598,794 FSTLookup size[B]=472,679 * Running 10 iterations for TSTLookup ... - warm-up 5 iterations... - main iterations: TSTLookup buildTime[ms]=19 lookupTime[ms]=4,813 * Running 10 iterations for FSTLookup ... - warm-up 5 iterations... - main iterations: FSTLookup buildTime[ms]=68 lookupTime[ms]=263 It looks as if I've made a mistake, but not really. With the actual data the fan-out rate is much higher than on long numbers. Also, real terms are shorter than longs and 75% of their prefix is usually a 2-3 letter combination resulting in lots of matches that need to be passed through the priority queue, etc. With the FST weights are part of the search structure (approximated weights, to be exact) and so the cost of building it is higher, but the gain at runtime is substantial. The lookup time should be virtually constant (with very small deviation). JaspellLookup does not pass the tests if the input prefix has a space inside . TSTLookup and FSTLookup work just fine. I compared against automata API in Morfologik; the speedup is about 30% (only ints are passed, no Arc object decompression, etc.). But then – the API of Lucene is more flexible. I'll clean up the code and create a final patch for testing/ evaluation tomorrow.
          Hide
          Michael McCandless added a comment -

          Wow those improvements are awesome – FST 26.7X smaller RAM footprint, 18X faster lookups, but build time is 3.6X slower.

          This is built on a composite reader, right? Does the build time include the time to enum the terms from MultiTermsEnum?

          Show
          Michael McCandless added a comment - Wow those improvements are awesome – FST 26.7X smaller RAM footprint, 18X faster lookups, but build time is 3.6X slower. This is built on a composite reader, right? Does the build time include the time to enum the terms from MultiTermsEnum?
          Hide
          Dawid Weiss added a comment -

          The build time needs to sort the input again (and create it in the first place). Because Lookup API assumes suggestion keywords can come from a variety of sources there is no guarantee they will be sorted, so we need to sort them before we can build the automaton.

          Still, I think the numbers are acceptable... if you need on-line construction of these suggestions you'll pick TST (it can add new keywords to the structure dynamically); for a batch-load suggester you'd pick the FST one.

          It is also very likely that I overlooked something that could bring those numbers down, I'll create a clean patch tomorrow, so everything will be out there for improving.

          Show
          Dawid Weiss added a comment - The build time needs to sort the input again (and create it in the first place). Because Lookup API assumes suggestion keywords can come from a variety of sources there is no guarantee they will be sorted, so we need to sort them before we can build the automaton. Still, I think the numbers are acceptable... if you need on-line construction of these suggestions you'll pick TST (it can add new keywords to the structure dynamically); for a batch-load suggester you'd pick the FST one. It is also very likely that I overlooked something that could bring those numbers down, I'll create a clean patch tomorrow, so everything will be out there for improving.
          Hide
          Dawid Weiss added a comment -

          FST implementation of Lookup interface. Completely rewritten benchmark class utilizing real data.

          Show
          Dawid Weiss added a comment - FST implementation of Lookup interface. Completely rewritten benchmark class utilizing real data.
          Hide
          Dawid Weiss added a comment -

          This is a git patch (trim the first level to apply without git, should work out of the box) against the trunk that implements FST-based suggester. I've also added a more realistic benchmark that shows performance limitations of the current suggester implementations, especially in the "onlyMorePopular" mode. Some benchmarks from my machine:

          Time to create internal data structures out of 50.000 suggestion terms (692kB of UTF8):

          JaspellLookup   input: 50,000, time[ms]: 30 [+- 4.93]
          TSTLookup       input: 50,000, time[ms]: 39 [+- 2.56]
          FSTLookup       input: 50,000, time[ms]: 62 [+- 2.52]
          

          Memory used:

          JaspellLookup   size[B]:    7,868,863
          TSTLookup       size[B]:    7,914,144
          FSTLookup       size[B]:      300,148
          

          Traversal speed (randomized prefixes of input sequences). We start with full matches:

          -- prefixes: 100-200, num: 7, onlyMorePopular: true
          JaspellLookup   queries: 50000, time[ms]: 108 [+- 6.23], ~qps: 462
          TSTLookup       queries: 50000, time[ms]: 54 [+- 1.54], ~qps: 922
          FSTLookup       queries: 50000, time[ms]: 66 [+- 1.30], ~qps: 761
          

          These results are overly optimistic because users rarely type in something in full; let's cut the prefix length to 6-9 chars:

          -- prefixes: 6-9, num: 7, onlyMorePopular: true
          JaspellLookup   queries: 50000, time[ms]: 155 [+- 4.20], ~qps: 322
          TSTLookup       queries: 50000, time[ms]: 208 [+- 3.99], ~qps: 241
          FSTLookup       queries: 50000, time[ms]: 71 [+- 1.36], ~qps: 700
          

          Clearly, the FST is more resilient to the length of the input prefix... let's cut it to 2-4 characters:

          -- prefixes: 2-4, num: 7, onlyMorePopular: true
          JaspellLookup   queries: 50000, time[ms]: 420 [+- 5.37], ~qps: 119
          TSTLookup       queries: 50000, time[ms]: 1687 [+- 10.96], ~qps: 30
          FSTLookup       queries: 50000, time[ms]: 90 [+- 2.27], ~qps: 554
          

          I didn't have the time to look into it, but TST's result is surprisingly bad with such short prefixes. FST keeps nearly the same perf. level.

          In the current implementation I throw an exception if somebody tries to get suggestions from FST without sorting by weights (this is equivalent to building the suggester with a single weight for all terms). This could be implemented, but at a small performance penalty – to be considered if you think it is useful. The above performance problems for short prefixes are interestingly present even with onlyMorePopular set to false:

          -- prefixes: 100-200, num: 7, onlyMorePopular: false
          JaspellLookup   queries: 50000, time[ms]: 94 [+- 6.45], ~qps: 532
          TSTLookup       queries: 50000, time[ms]: 46 [+- 0.98], ~qps: 1092
          -- prefixes: 6-9, num: 7, onlyMorePopular: false
          JaspellLookup   queries: 50000, time[ms]: 123 [+- 3.12], ~qps: 405
          TSTLookup       queries: 50000, time[ms]: 188 [+- 3.03], ~qps: 266
          -- prefixes: 2-4, num: 7, onlyMorePopular: false
          JaspellLookup   queries: 50000, time[ms]: 225 [+- 5.69], ~qps: 222
          TSTLookup       queries: 50000, time[ms]: 1523 [+- 16.69], ~qps: 33
          

          Peek at the code and let me know if you can think of any improvements/ modifications, especially wrt Solr infrastructure (I specifically didn't implement any real solr instance test case).

          Show
          Dawid Weiss added a comment - This is a git patch (trim the first level to apply without git, should work out of the box) against the trunk that implements FST-based suggester. I've also added a more realistic benchmark that shows performance limitations of the current suggester implementations, especially in the "onlyMorePopular" mode. Some benchmarks from my machine: Time to create internal data structures out of 50.000 suggestion terms (692kB of UTF8): JaspellLookup input: 50,000, time[ms]: 30 [+- 4.93] TSTLookup input: 50,000, time[ms]: 39 [+- 2.56] FSTLookup input: 50,000, time[ms]: 62 [+- 2.52] Memory used: JaspellLookup size[B]: 7,868,863 TSTLookup size[B]: 7,914,144 FSTLookup size[B]: 300,148 Traversal speed (randomized prefixes of input sequences). We start with full matches: -- prefixes: 100-200, num: 7, onlyMorePopular: true JaspellLookup queries: 50000, time[ms]: 108 [+- 6.23], ~qps: 462 TSTLookup queries: 50000, time[ms]: 54 [+- 1.54], ~qps: 922 FSTLookup queries: 50000, time[ms]: 66 [+- 1.30], ~qps: 761 These results are overly optimistic because users rarely type in something in full; let's cut the prefix length to 6-9 chars: -- prefixes: 6-9, num: 7, onlyMorePopular: true JaspellLookup queries: 50000, time[ms]: 155 [+- 4.20], ~qps: 322 TSTLookup queries: 50000, time[ms]: 208 [+- 3.99], ~qps: 241 FSTLookup queries: 50000, time[ms]: 71 [+- 1.36], ~qps: 700 Clearly, the FST is more resilient to the length of the input prefix... let's cut it to 2-4 characters: -- prefixes: 2-4, num: 7, onlyMorePopular: true JaspellLookup queries: 50000, time[ms]: 420 [+- 5.37], ~qps: 119 TSTLookup queries: 50000, time[ms]: 1687 [+- 10.96], ~qps: 30 FSTLookup queries: 50000, time[ms]: 90 [+- 2.27], ~qps: 554 I didn't have the time to look into it, but TST's result is surprisingly bad with such short prefixes. FST keeps nearly the same perf. level. In the current implementation I throw an exception if somebody tries to get suggestions from FST without sorting by weights (this is equivalent to building the suggester with a single weight for all terms). This could be implemented, but at a small performance penalty – to be considered if you think it is useful. The above performance problems for short prefixes are interestingly present even with onlyMorePopular set to false: -- prefixes: 100-200, num: 7, onlyMorePopular: false JaspellLookup queries: 50000, time[ms]: 94 [+- 6.45], ~qps: 532 TSTLookup queries: 50000, time[ms]: 46 [+- 0.98], ~qps: 1092 -- prefixes: 6-9, num: 7, onlyMorePopular: false JaspellLookup queries: 50000, time[ms]: 123 [+- 3.12], ~qps: 405 TSTLookup queries: 50000, time[ms]: 188 [+- 3.03], ~qps: 266 -- prefixes: 2-4, num: 7, onlyMorePopular: false JaspellLookup queries: 50000, time[ms]: 225 [+- 5.69], ~qps: 222 TSTLookup queries: 50000, time[ms]: 1523 [+- 16.69], ~qps: 33 Peek at the code and let me know if you can think of any improvements/ modifications, especially wrt Solr infrastructure (I specifically didn't implement any real solr instance test case).
          Hide
          Dawid Weiss added a comment -

          Updated empty input/ failing test case.

          Show
          Dawid Weiss added a comment - Updated empty input/ failing test case.
          Hide
          Dawid Weiss added a comment -

          I've been waiting for somebody to look at this patch, guys, just to confirm that everything is fine with it. If so, I'd like to commit it in and move on to infix suggestions support maybe.

          Show
          Dawid Weiss added a comment - I've been waiting for somebody to look at this patch, guys, just to confirm that everything is fine with it. If so, I'd like to commit it in and move on to infix suggestions support maybe.
          Hide
          Robert Muir added a comment -

          Took a quick look:

          Builder.add(char[], int, int, ..) adds codepoints (Character.codePointAt/Character.charCount) [utf-32 order] but the comparator you use when building the automaton compares characters [utf-16 order]. so if someone has a term in the supplementary range in their index, the order will be inconsistent.

          So I think the comparator should just compare codepoints (it should iterate with codePointAt/charCount too)?

          Show
          Robert Muir added a comment - Took a quick look: Builder.add(char[], int, int, ..) adds codepoints (Character.codePointAt/Character.charCount) [utf-32 order] but the comparator you use when building the automaton compares characters [utf-16 order] . so if someone has a term in the supplementary range in their index, the order will be inconsistent. So I think the comparator should just compare codepoints (it should iterate with codePointAt/charCount too)?
          Hide
          Yonik Seeley added a comment -

          If it causes too much of a lookup performance hit, the Builder could just build in utf-16 order too?

          Show
          Yonik Seeley added a comment - If it causes too much of a lookup performance hit, the Builder could just build in utf-16 order too?
          Hide
          Robert Muir added a comment -

          I am referring to build-time, not runtime here.

          run-time can handle supplementary characters wrong and I wouldn't object to committing it,
          but currently if someone has terms > 0xFFFF in their index it will preventing the FST from
          being built at all and suggesting will not work? (as i think the FST will throw exc?)

          Show
          Robert Muir added a comment - I am referring to build-time, not runtime here. run-time can handle supplementary characters wrong and I wouldn't object to committing it, but currently if someone has terms > 0xFFFF in their index it will preventing the FST from being built at all and suggesting will not work? (as i think the FST will throw exc?)
          Hide
          Dawid Weiss added a comment -

          Oh, right – I didn't peek at the inside of Builder.add(char[],...), but I will verify this by trying to add something that has multilingual stuff in it – will update the patch tomorrow, hopefully. I would also love to have somebody who actually uses suggestions to try to compile it and use it on a production data set to see if my benchmark was approximately right with respect to the speed differences between the different available implementations.

          Show
          Dawid Weiss added a comment - Oh, right – I didn't peek at the inside of Builder.add(char[],...), but I will verify this by trying to add something that has multilingual stuff in it – will update the patch tomorrow, hopefully. I would also love to have somebody who actually uses suggestions to try to compile it and use it on a production data set to see if my benchmark was approximately right with respect to the speed differences between the different available implementations.
          Hide
          Dawid Weiss added a comment -

          Well spotted, Robert – indeed, three-byte codepoints were throwing automaton exceptions. I've added a test for this. I also added "exact match promotion" to the top of the suggestions list, regardless of the score of the exact match. This is controlled by a final flag at the moment... maybe it should become a parameter, I don't know.

          Show
          Dawid Weiss added a comment - Well spotted, Robert – indeed, three-byte codepoints were throwing automaton exceptions. I've added a test for this. I also added "exact match promotion" to the top of the suggestions list, regardless of the score of the exact match. This is controlled by a final flag at the moment... maybe it should become a parameter, I don't know.
          Hide
          Dawid Weiss added a comment -

          Updated patch:

          • fixed a bug with unicode codepoints [rmuir]
          • added promotion of exact matches to the top of the list (weighted mode)
          • added a "poor man's" alphabetical lookup even if weights are used (still faster than the existing implementations, but suboptimal to the case when no weights are used at all).
          • Added a few test cases here and there, reorganized the code.
          • Turned thresholds into real config parameters.
          Show
          Dawid Weiss added a comment - Updated patch: fixed a bug with unicode codepoints [rmuir] added promotion of exact matches to the top of the list (weighted mode) added a "poor man's" alphabetical lookup even if weights are used (still faster than the existing implementations, but suboptimal to the case when no weights are used at all). Added a few test cases here and there, reorganized the code. Turned thresholds into real config parameters.
          Hide
          Dawid Weiss added a comment -

          Ok, updated patch. The only thing I would like to add is a real Solr handler test, much like SuggesterTest. Should I add a separate test class or simply add another handler to that config file and test methods to SuggesterTest?

          Also, this one puzzled me:

              threshold = config.get(THRESHOLD_TOKEN_FREQUENCY) == null ? 0.0f
                      : (Float)config.get(THRESHOLD_TOKEN_FREQUENCY);
          

          What are the conversion rules for NamedList that SolrSpellChecker gets in init()? I have an Integer parameter, but didn't check what is returned for, say, "12" (String, Float, Integer?).

          Show
          Dawid Weiss added a comment - Ok, updated patch. The only thing I would like to add is a real Solr handler test, much like SuggesterTest. Should I add a separate test class or simply add another handler to that config file and test methods to SuggesterTest? Also, this one puzzled me: threshold = config.get(THRESHOLD_TOKEN_FREQUENCY) == null ? 0.0f : ( Float )config.get(THRESHOLD_TOKEN_FREQUENCY); What are the conversion rules for NamedList that SolrSpellChecker gets in init()? I have an Integer parameter, but didn't check what is returned for, say, "12" (String, Float, Integer?).
          Hide
          Dawid Weiss added a comment -

          Adding Solr tests, removed the big queries file so that it doesn't bloat the patch (will commit it in directly).

          Show
          Dawid Weiss added a comment - Adding Solr tests, removed the big queries file so that it doesn't bloat the patch (will commit it in directly).
          Hide
          Dawid Weiss added a comment -

          In trunk.

          Show
          Dawid Weiss added a comment - In trunk.
          Hide
          Robert Muir added a comment -

          Just an idea: should we default to this implementation in trunk?
          It seems to be a significant reduction in RAM.

          Show
          Robert Muir added a comment - Just an idea: should we default to this implementation in trunk? It seems to be a significant reduction in RAM.
          Hide
          Dawid Weiss added a comment -

          We could... after all when 4.0 sees the light of day backwards compatibility does not need to be (strictly) enforced, right? I would love somebody who actually uses this piece of functionality to test it in a realistic setting first though. If nothing else, this would at least tell us if any performance gain is visible at all (I'm guessing the HTTP stack will be the bottleneck before we can see any differences between all the implemented methods). The big drop in RAM usage should be visible right away, so this is a bonus.

          Show
          Dawid Weiss added a comment - We could... after all when 4.0 sees the light of day backwards compatibility does not need to be (strictly) enforced, right? I would love somebody who actually uses this piece of functionality to test it in a realistic setting first though. If nothing else, this would at least tell us if any performance gain is visible at all (I'm guessing the HTTP stack will be the bottleneck before we can see any differences between all the implemented methods). The big drop in RAM usage should be visible right away, so this is a bonus.
          Hide
          Robert Muir added a comment -

          Note: this is now backported to branch_3x.

          Show
          Robert Muir added a comment - Note: this is now backported to branch_3x.
          Hide
          Robert Muir added a comment -

          Bulk close for 3.3

          Show
          Robert Muir added a comment - Bulk close for 3.3

            People

            • Assignee:
              Dawid Weiss
              Reporter:
              Dawid Weiss
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development