Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 3.1
    • Component/s: search
    • Labels:
      None

      Description

      Autosuggest is a common search function that can be integrated
      into Solr as a SearchComponent. Our first implementation will
      use the TernaryTree found in Lucene contrib.

      • Enable creation of the dictionary from the index or via Solr's
        RPC mechanism
      • What types of parameters and settings are desirable?
      • Hopefully in the future we can include user click through
        rates to boost those terms/phrases higher
      1. SOLR-1316_3x-2.patch
        76 kB
        Andrzej Bialecki
      2. SOLR-1316_3x.patch
        75 kB
        Grant Ingersoll
      3. SOLR-1316.patch
        9 kB
        Yonik Seeley
      4. SOLR-1316.patch
        79 kB
        Andrzej Bialecki
      5. SOLR-1316.patch
        74 kB
        Andrzej Bialecki
      6. SOLR-1316.patch
        75 kB
        Andrzej Bialecki
      7. SOLR-1316.patch
        110 kB
        Andrzej Bialecki
      8. SOLR-1316.patch
        72 kB
        Ankul Garg
      9. SOLR-1316.patch
        72 kB
        Ankul Garg
      10. SOLR-1316.patch
        72 kB
        Andrzej Bialecki
      11. SOLR-1316.patch
        73 kB
        Andrzej Bialecki
      12. SOLR-1316.patch
        69 kB
        Andrzej Bialecki
      13. SOLR-1316.patch
        69 kB
        Andrzej Bialecki
      14. suggest.patch
        52 kB
        Andrzej Bialecki
      15. suggest.patch
        85 kB
        Andrzej Bialecki
      16. TST.zip
        6 kB
        Ankul Garg
      17. suggest.patch
        80 kB
        Andrzej Bialecki

        Issue Links

          Activity

          Hide
          Jason Rutherglen added a comment -

          An alternative to the TernaryTree which does not offer a
          traverse method is a Patricia Trie which conveniently has been
          Apache licensed and implemented at:
          http://code.google.com/p/patricia-trie/

          Further links about Patricia Tries:

          Show
          Jason Rutherglen added a comment - An alternative to the TernaryTree which does not offer a traverse method is a Patricia Trie which conveniently has been Apache licensed and implemented at: http://code.google.com/p/patricia-trie/ Further links about Patricia Tries: http://en.wikipedia.org/wiki/Radix_tree http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/PATRICIA http://www.imperialviolet.org/binary/critbit.pdf
          Hide
          Jason Rutherglen added a comment -

          Patch coming...

          Show
          Jason Rutherglen added a comment - Patch coming...
          Hide
          Andrzej Bialecki added a comment -

          Jason, did you make any progress on this? I'm interested in this functionality.. I'm not sure tries are the best choice, unless heavily pruned they occupy a lot of RAM space. I had some moderate success using ngram based method (I reused the spellchecker, with slight modifications) - the method is fast and reuses the existing spellchecker index, but precision of lookups is not ideal.

          Show
          Andrzej Bialecki added a comment - Jason, did you make any progress on this? I'm interested in this functionality.. I'm not sure tries are the best choice, unless heavily pruned they occupy a lot of RAM space. I had some moderate success using ngram based method (I reused the spellchecker, with slight modifications) - the method is fast and reuses the existing spellchecker index, but precision of lookups is not ideal.
          Hide
          Jason Rutherglen added a comment -

          Andrzej,

          There's the ternary tree which is supposed to be better? There are other algorithms for compressed dictionaries that could be used (off hand I can't think of them).

          Show
          Jason Rutherglen added a comment - Andrzej, There's the ternary tree which is supposed to be better? There are other algorithms for compressed dictionaries that could be used (off hand I can't think of them).
          Hide
          Jason Rutherglen added a comment -

          Basically we need an algorithm that does suffix compression as well?

          Show
          Jason Rutherglen added a comment - Basically we need an algorithm that does suffix compression as well?
          Hide
          Shalin Shekhar Mangar added a comment -

          This is a duplicate of SOLR-706.

          Since we comments on both, which one should I close?

          Show
          Shalin Shekhar Mangar added a comment - This is a duplicate of SOLR-706 . Since we comments on both, which one should I close?
          Hide
          Ankul Garg added a comment -

          Hi Shalin sir,
          Me and my team have successfully benchmarked Ternary Tree and Trie for autocomplete, and Ternary Tree gives the best insertion and search time. We have started working on creating a patch that can be integrated with Solr as a search component. The patch is expected to come soon.

          Show
          Ankul Garg added a comment - Hi Shalin sir, Me and my team have successfully benchmarked Ternary Tree and Trie for autocomplete, and Ternary Tree gives the best insertion and search time. We have started working on creating a patch that can be integrated with Solr as a search component. The patch is expected to come soon.
          Hide
          Jason Rutherglen added a comment -

          Ankul, sounds good, feel free to post your ternary tree implementation.

          There's some other algorithms to think about:
          "Incremental Construction of Minimal Acyclic Finite-State Automata"
          http://arxiv.org/PS_cache/cs/pdf/0007/0007009v1.pdf

          "Directed acyclic word graph"
          http://en.wikipedia.org/wiki/Directed_acyclic_word_graph

          "worlds fastest scrabble program"
          http://www1.cs.columbia.edu/~kathy/cs4701/documents/aj.pdf

          These enable suffix compression and create much smaller word graphs.

          Show
          Jason Rutherglen added a comment - Ankul, sounds good, feel free to post your ternary tree implementation. There's some other algorithms to think about: "Incremental Construction of Minimal Acyclic Finite-State Automata" http://arxiv.org/PS_cache/cs/pdf/0007/0007009v1.pdf "Directed acyclic word graph" http://en.wikipedia.org/wiki/Directed_acyclic_word_graph "worlds fastest scrabble program" http://www1.cs.columbia.edu/~kathy/cs4701/documents/aj.pdf These enable suffix compression and create much smaller word graphs.
          Show
          Jason Rutherglen added a comment - And a video: Lecture 25 http://see.stanford.edu/see/lecturelist.aspx?coll=11f4f422-5670-4b4c-889c-008262e09e4e
          Hide
          Ankul Garg added a comment -

          Hi Jason,
          My TST implementation is here. The zip contains 4 benchmarking results too : TST1.txt , TST2.txt etc.

          The 4 datasets were as follows :
          All words are real life words extracted from dbpedia dump.
          1. The first dataset contains 1,00,000 tokens consisting of single words, phrases of two words and phrases of three words.
          2. The second dataset contains 5,00,000 tokens consisting of single words, phrases of two words and phrases of three words.
          3. The third dataset contains 10,00,000 tokens consisting of single words, phrases of two words and phrases of three words.
          4. The fourth dataset contains 50,00,000 tokens consisting of single words, phrases of two words and phrases of three words.

          These were the environment details while benchmarking :
          Platfrom : Linux
          java version "1.6.0_16"
          Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
          Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode)
          RAM : 16GiB
          Java HeapSize : default

          Is there any other way to balance the tree? Also, what's your progress?

          Show
          Ankul Garg added a comment - Hi Jason, My TST implementation is here. The zip contains 4 benchmarking results too : TST1.txt , TST2.txt etc. The 4 datasets were as follows : All words are real life words extracted from dbpedia dump. 1. The first dataset contains 1,00,000 tokens consisting of single words, phrases of two words and phrases of three words. 2. The second dataset contains 5,00,000 tokens consisting of single words, phrases of two words and phrases of three words. 3. The third dataset contains 10,00,000 tokens consisting of single words, phrases of two words and phrases of three words. 4. The fourth dataset contains 50,00,000 tokens consisting of single words, phrases of two words and phrases of three words. These were the environment details while benchmarking : Platfrom : Linux java version "1.6.0_16" Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode) RAM : 16GiB Java HeapSize : default Is there any other way to balance the tree? Also, what's your progress?
          Hide
          Grant Ingersoll added a comment -

          Note, not sure how it compares, but Lucene has a TST implementation in it already.

          Show
          Grant Ingersoll added a comment - Note, not sure how it compares, but Lucene has a TST implementation in it already.
          Hide
          Shalin Shekhar Mangar added a comment -

          Note, not sure how it compares, but Lucene has a TST implementation in it already.

          Yes at org.apache.lucene.analysis.compound.hyphenation.TernaryTree but it uses char type as a pointer thus limiting it to around 65K nodes. That will not be enough.

          Show
          Shalin Shekhar Mangar added a comment - Note, not sure how it compares, but Lucene has a TST implementation in it already. Yes at org.apache.lucene.analysis.compound.hyphenation.TernaryTree but it uses char type as a pointer thus limiting it to around 65K nodes. That will not be enough.
          Hide
          Ankul Garg added a comment -

          Also, Lucene's TST implementation doesn't has any method for autocompletion.
          I had problems understanding TST and the Lucene's TST implementation, so I
          felt its better to code it myself.

          Show
          Ankul Garg added a comment - Also, Lucene's TST implementation doesn't has any method for autocompletion. I had problems understanding TST and the Lucene's TST implementation, so I felt its better to code it myself.
          Hide
          Andrzej Bialecki added a comment -

          These enable suffix compression and create much smaller word graphs.

          DAWGs are problematic, because they are essentially immutable once created (the cost of insert / delete is very high). So I propose to stick to TSTs for now.

          Also, I think that populating TST from the index would have to be discriminative, perhaps based on a threshold (so that it only adds terms with large enough docFreq), and it would be good to adjust the content of the tree based on actual queries that return some results (poor man's auto-learning), gradually removing least frequent strings to save space.. We could also use as a source a field with 1-3 word shingles (no tf, unstored, to save space in the source index, with a similar thresholding mechanism).

          Ankul, I'm not sure what's the behavior of your implementation when dynamically adding / removing keys? Does it still remain balanced?

          I also found a MIT-licensed impl. of radix tree here: http://code.google.com/p/radixtree, which looks good too, one spelling mistake in the API notwithstanding

          Show
          Andrzej Bialecki added a comment - These enable suffix compression and create much smaller word graphs. DAWGs are problematic, because they are essentially immutable once created (the cost of insert / delete is very high). So I propose to stick to TSTs for now. Also, I think that populating TST from the index would have to be discriminative, perhaps based on a threshold (so that it only adds terms with large enough docFreq), and it would be good to adjust the content of the tree based on actual queries that return some results (poor man's auto-learning), gradually removing least frequent strings to save space.. We could also use as a source a field with 1-3 word shingles (no tf, unstored, to save space in the source index, with a similar thresholding mechanism). Ankul, I'm not sure what's the behavior of your implementation when dynamically adding / removing keys? Does it still remain balanced? I also found a MIT-licensed impl. of radix tree here: http://code.google.com/p/radixtree , which looks good too, one spelling mistake in the API notwithstanding
          Hide
          Ankul Garg added a comment -

          Removing keys shall not affect the balancing of the tree as it can be easily
          done by making the boolean end at the leaf as false. Adding keys dynamically
          wont really keep the tree balanced in my implementation, as in my
          implementation the tree is balanced by ordered insertion of keys. So while
          adding more keys, the TST will have to be rebuilt to make it balanced. Will
          that be problematic?

          Show
          Ankul Garg added a comment - Removing keys shall not affect the balancing of the tree as it can be easily done by making the boolean end at the leaf as false. Adding keys dynamically wont really keep the tree balanced in my implementation, as in my implementation the tree is balanced by ordered insertion of keys. So while adding more keys, the TST will have to be rebuilt to make it balanced. Will that be problematic?
          Hide
          Shalin Shekhar Mangar added a comment -

          DAWGs are problematic, because they are essentially immutable once created (the cost of insert / delete is very high)

          Andrej, why would immutability be a problem? Wouldn't we have to re-build the TST if the source index changes?

          Also, I think that populating TST from the index would have to be discriminative, perhaps based on a threshold

          I think the building of the data structure can be done in a way similar to what SpellCheckComponent does. We can re-use the HighFrequencyDictionary which can give tokens above a certain threshold frequency. The field names to use for building the data structure and the analysis can also be done like SCC. The response format for this component can also be similar to SCC.

          Show
          Shalin Shekhar Mangar added a comment - DAWGs are problematic, because they are essentially immutable once created (the cost of insert / delete is very high) Andrej, why would immutability be a problem? Wouldn't we have to re-build the TST if the source index changes? Also, I think that populating TST from the index would have to be discriminative, perhaps based on a threshold I think the building of the data structure can be done in a way similar to what SpellCheckComponent does. We can re-use the HighFrequencyDictionary which can give tokens above a certain threshold frequency. The field names to use for building the data structure and the analysis can also be done like SCC. The response format for this component can also be similar to SCC.
          Hide
          Andrzej Bialecki added a comment -

          Andrej, why would immutability be a problem? Wouldn't we have to re-build the TST if the source index changes?

          Well, the use case I have in mind is a TST that improves itself over time based on the observed query log. I.e. you would bootstrap a TST from the index (and here indeed you can do this on every searcher refresh), but it's often claimed that real query logs provide a far better source of autocomplete than the index terms. My idea was to start with what you have - in the absence of query logs - and then improve upon it by adding successful queries (and removing least-used terms to keep the tree at a more or less constant size).

          Alternatively we could provide an option to bootstrap it from a real query log data.

          This use case requires mutability, hence my negative opinion about DAGWs (besides, we are lacking an implementation, don't we, whereas we already have a few suitable TST implementations). Perhaps this doesn't have to be an either/or, if we come up with a pluggable interface for this type of component?

          I think the building of the data structure can be done in a way similar to what SpellCheckComponent does. [..]

          +1

          Show
          Andrzej Bialecki added a comment - Andrej, why would immutability be a problem? Wouldn't we have to re-build the TST if the source index changes? Well, the use case I have in mind is a TST that improves itself over time based on the observed query log. I.e. you would bootstrap a TST from the index (and here indeed you can do this on every searcher refresh), but it's often claimed that real query logs provide a far better source of autocomplete than the index terms. My idea was to start with what you have - in the absence of query logs - and then improve upon it by adding successful queries (and removing least-used terms to keep the tree at a more or less constant size). Alternatively we could provide an option to bootstrap it from a real query log data. This use case requires mutability, hence my negative opinion about DAGWs (besides, we are lacking an implementation, don't we, whereas we already have a few suitable TST implementations). Perhaps this doesn't have to be an either/or, if we come up with a pluggable interface for this type of component? I think the building of the data structure can be done in a way similar to what SpellCheckComponent does. [..] +1
          Hide
          Andrzej Bialecki added a comment -

          I started working on a skeleton component for this, so that we can test various ideas and implementations. Patch is coming shortly.

          Show
          Andrzej Bialecki added a comment - I started working on a skeleton component for this, so that we can test various ideas and implementations. Patch is coming shortly.
          Hide
          Jason Rutherglen added a comment -

          Andrzej,

          Is it necessary to create a new abstraction layer? It looks like the SolrSpellChecker abstraction will work?

          Show
          Jason Rutherglen added a comment - Andrzej, Is it necessary to create a new abstraction layer? It looks like the SolrSpellChecker abstraction will work?
          Hide
          Andrzej Bialecki added a comment -

          Yes, it should work for now. In fact I started writing a new component, but it had to replicate most of the spellchecker so I will just add bits to the existing spellchecker. I'm worried though that we abuse the semantics of the API, and it will be more difficult to fit both functions in a single API as the functionality evolves.

          Show
          Andrzej Bialecki added a comment - Yes, it should work for now. In fact I started writing a new component, but it had to replicate most of the spellchecker so I will just add bits to the existing spellchecker. I'm worried though that we abuse the semantics of the API, and it will be more difficult to fit both functions in a single API as the functionality evolves.
          Hide
          Jason Rutherglen added a comment -

          The DAWG seems like a potential fit as a replacement for the
          Lucene term dictionary. It would provide the extra benefit of
          faster prefix etc lookups. I believe it could be stored on disk
          by writing file pointers to the locations of the letters. I
          found the Stanford lecture on them interesting, though the
          papers seem to overcomplicate them. I coauld not find an existing
          Java implementation.

          As a generic library I think it could be useful for a variety of
          Lucene based use cases (i.e. storing terms in a compact form
          that allows fast lookups, prefix and otherwise).

          Show
          Jason Rutherglen added a comment - The DAWG seems like a potential fit as a replacement for the Lucene term dictionary. It would provide the extra benefit of faster prefix etc lookups. I believe it could be stored on disk by writing file pointers to the locations of the letters. I found the Stanford lecture on them interesting, though the papers seem to overcomplicate them. I coauld not find an existing Java implementation. As a generic library I think it could be useful for a variety of Lucene based use cases (i.e. storing terms in a compact form that allows fast lookups, prefix and otherwise).
          Hide
          Andrzej Bialecki added a comment -

          This is a very much work in progress, to get review before proceeding.

          Highlights of this patch:

          • created a set of interfaces in o.a.s.spelling.suggest to hide implementation details of various autocomplete mechanisms.
          • imported sources of RadixTree, Jaspell TST and Ankul's TST. Wrapped each implementation so that it works with the same interface. (Ankul: I couldn't figure out how to actually retrieve suggested keys from your TST?)
          • extended HighFrequencyDictionary to return TermFreqIterator, which gives not only words but also their frequencies. Implemented a similar iterator for file-based term-freq lists.
          Show
          Andrzej Bialecki added a comment - This is a very much work in progress, to get review before proceeding. Highlights of this patch: created a set of interfaces in o.a.s.spelling.suggest to hide implementation details of various autocomplete mechanisms. imported sources of RadixTree, Jaspell TST and Ankul's TST. Wrapped each implementation so that it works with the same interface. (Ankul: I couldn't figure out how to actually retrieve suggested keys from your TST?) extended HighFrequencyDictionary to return TermFreqIterator, which gives not only words but also their frequencies. Implemented a similar iterator for file-based term-freq lists.
          Hide
          Ankul Garg added a comment -

          For the purpose of benchmarking alone, I employed DFS just to find the
          number of hits for each autocomplete. But to retrieve complete keys, just
          create a string key variable in the TST node. In the insert function, where
          end boolean has been declared to be true (the termination condition), key
          can be assigned the complete string. I will modify the code and post it
          soon. May be you can also do the changes as described above.

          On Sun, Sep 20, 2009 at 12:11 AM, Andrzej Bialecki (JIRA)

          Show
          Ankul Garg added a comment - For the purpose of benchmarking alone, I employed DFS just to find the number of hits for each autocomplete. But to retrieve complete keys, just create a string key variable in the TST node. In the insert function, where end boolean has been declared to be true (the termination condition), key can be assigned the complete string. I will modify the code and post it soon. May be you can also do the changes as described above. On Sun, Sep 20, 2009 at 12:11 AM, Andrzej Bialecki (JIRA)
          Hide
          Ankul Garg added a comment -

          Modified the code for returning a list of suggest keys. Andrez kindly update the same in your patch.

          Show
          Ankul Garg added a comment - Modified the code for returning a list of suggest keys. Andrez kindly update the same in your patch.
          Hide
          Ishan Chattopadhyaya added a comment -

          For an extremely fast in-memory lookup table, I saw a TrieMap used in one of my projects. In a Trie Map, the nodes of a Hash Map are internally arranged like a Trie. The following implementation is very space efficient:
          http://airhead-research.googlecode.com/svn/trunk/sspace/src/edu/ucla/sspace/util/

          Also, I have a fast memory mapped file based on disk TST implementation. If someone things it would be good, I can submit a patch.

          Show
          Ishan Chattopadhyaya added a comment - For an extremely fast in-memory lookup table, I saw a TrieMap used in one of my projects. In a Trie Map, the nodes of a Hash Map are internally arranged like a Trie. The following implementation is very space efficient: http://airhead-research.googlecode.com/svn/trunk/sspace/src/edu/ucla/sspace/util/ Also, I have a fast memory mapped file based on disk TST implementation. If someone things it would be good, I can submit a patch.
          Hide
          Jason Rutherglen added a comment -

          Ishan,

          Feel free to post your disk TST implementation, it sounds interesting.

          Show
          Jason Rutherglen added a comment - Ishan, Feel free to post your disk TST implementation, it sounds interesting.
          Hide
          Andrzej Bialecki added a comment -

          Updated patch that includes the new TST sources. Tests on a 100k-words dictionary yield very similar results for the TST and Jaspell implementations, i.e. the initial build time is around 600ms, and then the lookup time is around 4-7ms for prefixes that yield more than 100 results.

          To test it put this in your solrconfig.xml:

            <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
              <lst name="spellchecker">
                <str name="name">suggest</str>
                <str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
                <str name="lookupImpl">org.apache.solr.spelling.suggest.jaspell.JaspellLookup</str>
                <str name="field">text</str>
                <str name="sourceLocation">american-english</str>
              </lst>
            </searchComponent>
          
          ...
          
          
          

          And then use e.g. the following parameters:

          spellcheck=true&spellcheck.build=true&spellcheck.dictionary=suggest& \
          spellcheck.extendedResults=true&spellcheck.count=100&q=test
          
          Show
          Andrzej Bialecki added a comment - Updated patch that includes the new TST sources. Tests on a 100k-words dictionary yield very similar results for the TST and Jaspell implementations, i.e. the initial build time is around 600ms, and then the lookup time is around 4-7ms for prefixes that yield more than 100 results. To test it put this in your solrconfig.xml: <searchComponent name= "spellcheck" class= "solr.SpellCheckComponent" > <lst name= "spellchecker" > <str name= "name" > suggest </str> <str name= "classname" > org.apache.solr.spelling.suggest.Suggester </str> <str name= "lookupImpl" > org.apache.solr.spelling.suggest.jaspell.JaspellLookup </str> <str name= "field" > text </str> <str name= "sourceLocation" > american-english </str> </lst> </searchComponent> ... And then use e.g. the following parameters: spellcheck=true&spellcheck.build=true&spellcheck.dictionary=suggest& \ spellcheck.extendedResults=true&spellcheck.count=100&q=test
          Hide
          Andrzej Bialecki added a comment -

          Forgot to add - the RadixTree implementation doesn't work for now - it needs further refactoring to return the completed keys, and not just the values stored in nodes ...

          Show
          Andrzej Bialecki added a comment - Forgot to add - the RadixTree implementation doesn't work for now - it needs further refactoring to return the completed keys, and not just the values stored in nodes ...
          Hide
          Ankul Garg added a comment -

          Nice work Andrzej!!! There's a little problem with insertion of tokens in build function of TSTLookup class. Strings in HighFrequencyDictionary must be in sorted order and simply iterating over the dict and adding strings in sorted order in TST will make the tree highly unbalanced. An ordered insertion of strings in the same way as one does a binary search over a sorted list will make the tree balanced. The function balancedTree of TSTAutocomplete class takes care of that.

          Show
          Ankul Garg added a comment - Nice work Andrzej!!! There's a little problem with insertion of tokens in build function of TSTLookup class. Strings in HighFrequencyDictionary must be in sorted order and simply iterating over the dict and adding strings in sorted order in TST will make the tree highly unbalanced. An ordered insertion of strings in the same way as one does a binary search over a sorted list will make the tree balanced. The function balancedTree of TSTAutocomplete class takes care of that.
          Hide
          Ankul Garg added a comment -

          Andrzej, how are you creating the new patch? The Solr svn server seems to be down!!! Tell me asap, got to update the patch.

          Show
          Ankul Garg added a comment - Andrzej, how are you creating the new patch? The Solr svn server seems to be down!!! Tell me asap, got to update the patch.
          Hide
          Andrzej Bialecki added a comment -

          Re: the tree creation - well, this is the current limitation of the Dictionary API that provides only an Iterator. So in general case it's not possible to start from the middle of the iterator so that the tree is well-balanced. Is it possible to re-balance the tree on the fly?

          Re: svn - it works for me ...

          Show
          Andrzej Bialecki added a comment - Re: the tree creation - well, this is the current limitation of the Dictionary API that provides only an Iterator. So in general case it's not possible to start from the middle of the iterator so that the tree is well-balanced. Is it possible to re-balance the tree on the fly? Re: svn - it works for me ...
          Hide
          Ankul Garg added a comment -

          Re: I think we can first add the terms and frequency in separate ArrayLists using the iterator and then start from the middle?

          Show
          Ankul Garg added a comment - Re: I think we can first add the terms and frequency in separate ArrayLists using the iterator and then start from the middle?
          Hide
          Andrzej Bialecki added a comment -

          Well, this is kind of ugly, because it increases the memory footprint of the build phase - that was the whole point of using Iterator in the Dictionary, so that you don't have to cache all dictionary data in memory - dictionaries could be large, and they are not guaranteed to be sorted and with unique keys.

          But if there are no better options for now, then yes we could do this just in TSTLookup. Is there really no way to rebalance the tree?

          Show
          Andrzej Bialecki added a comment - Well, this is kind of ugly, because it increases the memory footprint of the build phase - that was the whole point of using Iterator in the Dictionary, so that you don't have to cache all dictionary data in memory - dictionaries could be large, and they are not guaranteed to be sorted and with unique keys. But if there are no better options for now, then yes we could do this just in TSTLookup. Is there really no way to rebalance the tree?
          Hide
          Ankul Garg added a comment -

          I couldn't find any way to balance the tree dynamically at each insertion. But am trying to figure out some possible way (may be the way binary trees are balanced by dynamically modifying the root of the tree). Till then we can balance it by adding terms to a List and then inserting as mentioned above. Or in case the Dictionary is not sorted and is randomly ordered, then a random insertion of strings will also give roughly a balanced tree. We can benchmark it both ways. What do you say?

          Show
          Ankul Garg added a comment - I couldn't find any way to balance the tree dynamically at each insertion. But am trying to figure out some possible way (may be the way binary trees are balanced by dynamically modifying the root of the tree). Till then we can balance it by adding terms to a List and then inserting as mentioned above. Or in case the Dictionary is not sorted and is randomly ordered, then a random insertion of strings will also give roughly a balanced tree. We can benchmark it both ways. What do you say?
          Hide
          Mike Anderson added a comment -

          Two questions, and apologies if they are addressed in the patches themselves:

          1) Will this be supported on distributed setups?

          2) (maybe this is a new ticket) Is it possible to store the field name in the spellcheck index. The use case for this is: I create a dictionary from the 'person' field and the 'title' field, when using autosuggest it would be nice to separate suggestions into 'person' suggestions and 'title' suggestions.

          Show
          Mike Anderson added a comment - Two questions, and apologies if they are addressed in the patches themselves: 1) Will this be supported on distributed setups? 2) (maybe this is a new ticket) Is it possible to store the field name in the spellcheck index. The use case for this is: I create a dictionary from the 'person' field and the 'title' field, when using autosuggest it would be nice to separate suggestions into 'person' suggestions and 'title' suggestions.
          Hide
          Ankul Garg added a comment -

          Re: Mike
          Am answering your 2nd query. Yes, it is possible to auto-suggest separately for separate fields. Create separate NamedList configurations in the solrconfig.xml file specifying the fieldname(s) for each configuration. Now, words from the solr index will be extracted only from the specified fieldname(s) for each configuration and also separate search trees will be created for each configuration.

          Show
          Ankul Garg added a comment - Re: Mike Am answering your 2nd query. Yes, it is possible to auto-suggest separately for separate fields. Create separate NamedList configurations in the solrconfig.xml file specifying the fieldname(s) for each configuration. Now, words from the solr index will be extracted only from the specified fieldname(s) for each configuration and also separate search trees will be created for each configuration.
          Hide
          Andrzej Bialecki added a comment -

          Re: question 1 - currently this component doesn't support populating the dictionary from a distributed index.

          Show
          Andrzej Bialecki added a comment - Re: question 1 - currently this component doesn't support populating the dictionary from a distributed index.
          Hide
          Shalin Shekhar Mangar added a comment -

          I've started looking into the patch.

          1. Why do we concatenate all the tokens into one before calling Lookup#lookup? It seems we should be getting suggestions for each token just as SpellCheckComponent does.
          2. Related to #1, the Lookup#lookup method should return something more fine grained rather than a SpellingResult
          3. Has anyone done any benchmarking to figure out the data structure we want to go ahead with?

          I love that we are (ab)using the SpellCheckComponent. The good part is that if we go this route, this auto-suggest pseudo-component will automatically work with distributed setups.

          Show
          Shalin Shekhar Mangar added a comment - I've started looking into the patch. Why do we concatenate all the tokens into one before calling Lookup#lookup? It seems we should be getting suggestions for each token just as SpellCheckComponent does. Related to #1, the Lookup#lookup method should return something more fine grained rather than a SpellingResult Has anyone done any benchmarking to figure out the data structure we want to go ahead with? I love that we are (ab)using the SpellCheckComponent. The good part is that if we go this route, this auto-suggest pseudo-component will automatically work with distributed setups.
          Hide
          Andrzej Bialecki added a comment -

          Thanks for the review!

          Why do we concatenate all the tokens into one before calling Lookup#lookup? It seems we should be getting suggestions for each token just as SpellCheckComponent does.

          Yeah, it's disputable, and we could change it to use single tokens ... My thinking was that the usual scenario is that you submit autosuggest queries soon after user starts typing the query, and the highest perceived value of such functionality is when it can suggest complete meaningful phrases and not just individual terms. I.e. when you start typing "token sug" it won't suggest "token sugar" but instead it will suggest "token suggestions".

          Related to #1, the Lookup#lookup method should return something more fine grained rather than a SpellingResult

          Such as? What you put there is what you get so the fact that we are getting complete phrases as suggestions is the consequence of the choice above - the trie in this case is populated with phrases. If we populate it with tokens, then we can return per-token suggestions, again - losing the added value I mentioned above.

          Has anyone done any benchmarking to figure out the data structure we want to go ahead with?

          For now I'm sure that we do NOT want to use the impl. of RadixTree in this patch, because it doesn't support our use case - I'll prepare a patch that removes this impl. Other implementations seem comparable wrt. to the speed, based on casual tests using /usr/share/dict/words, but I didn't run any exact benchmarks yet.

          Show
          Andrzej Bialecki added a comment - Thanks for the review! Why do we concatenate all the tokens into one before calling Lookup#lookup? It seems we should be getting suggestions for each token just as SpellCheckComponent does. Yeah, it's disputable, and we could change it to use single tokens ... My thinking was that the usual scenario is that you submit autosuggest queries soon after user starts typing the query, and the highest perceived value of such functionality is when it can suggest complete meaningful phrases and not just individual terms. I.e. when you start typing "token sug" it won't suggest "token sugar" but instead it will suggest "token suggestions". Related to #1, the Lookup#lookup method should return something more fine grained rather than a SpellingResult Such as? What you put there is what you get so the fact that we are getting complete phrases as suggestions is the consequence of the choice above - the trie in this case is populated with phrases. If we populate it with tokens, then we can return per-token suggestions, again - losing the added value I mentioned above. Has anyone done any benchmarking to figure out the data structure we want to go ahead with? For now I'm sure that we do NOT want to use the impl. of RadixTree in this patch, because it doesn't support our use case - I'll prepare a patch that removes this impl. Other implementations seem comparable wrt. to the speed, based on casual tests using /usr/share/dict/words, but I didn't run any exact benchmarks yet.
          Hide
          Ankul Garg added a comment -

          Shouldn't we be creating a separate AutoSuggestComponent like the SpellCheckComponent havings its own prepare, process and inform functions?

          Show
          Ankul Garg added a comment - Shouldn't we be creating a separate AutoSuggestComponent like the SpellCheckComponent havings its own prepare, process and inform functions?
          Hide
          Shalin Shekhar Mangar added a comment -

          My thinking was that the usual scenario is that you submit autosuggest queries soon after user starts typing the query, and the highest perceived value of such functionality is when it can suggest complete meaningful phrases and not just individual terms. I.e. when you start typing "token sug" it won't suggest "token sugar" but instead it will suggest "token suggestions".

          Yes but the decision of selecting the complete phrase or an individual term should be up to the user. This is controlled by the "queryAnalyzerFieldType" in SpellCheckComponent. We will index tokens returned by that analyzer so the user can configure whichever behavior he wants. For example, if it is KeywordAnalyzer, we will index/suggest phrases and if it is a WhitespaceAnalyzer we will index/suggest individual terms.

          Such as? What you put there is what you get so the fact that we are getting complete phrases as suggestions is the consequence of the choice above - the trie in this case is populated with phrases. If we populate it with tokens, then we can return per-token suggestions, again - losing the added value I mentioned above.

          My point was that SpellingResult is too coarse. It is a complete result (for all tokens given by "queryAnalyzerFieldType"). If that analyzer gives us multiple tokens then we must get suggestions for each. In that case returning a SpellingResult for each token is not right. Instead the Suggestor should combine suggestions for all tokens into a SpellingResult object. I don't have a suggestion on an alternative. Looks like we may need to invent a custom type which represents the (suggestion, frequency) pair.

          For now I'm sure that we do NOT want to use the impl. of RadixTree in this patch, because it doesn't support our use case - I'll prepare a patch that removes this impl. Other implementations seem comparable wrt. to the speed, based on casual tests using /usr/share/dict/words, but I didn't run any exact benchmarks yet.

          OK. Go ahead with the patch and I'll try to find some time to compare the two methods. What about DAWGs? Are we still considering them?

          Shouldn't we be creating a separate AutoSuggestComponent like the SpellCheckComponent havings its own prepare, process and inform functions?

          We could do that but as Andrej noted, we'd end up re-implementing a lot of its functionality. I'm not sure if it is worth it. I agree that it'd be odd using parameters prefixed with "spellcheck" for auto-suggest and it'd have been easier if it were vice-versa. Does anybody have a suggestion?

          Show
          Shalin Shekhar Mangar added a comment - My thinking was that the usual scenario is that you submit autosuggest queries soon after user starts typing the query, and the highest perceived value of such functionality is when it can suggest complete meaningful phrases and not just individual terms. I.e. when you start typing "token sug" it won't suggest "token sugar" but instead it will suggest "token suggestions". Yes but the decision of selecting the complete phrase or an individual term should be up to the user. This is controlled by the "queryAnalyzerFieldType" in SpellCheckComponent. We will index tokens returned by that analyzer so the user can configure whichever behavior he wants. For example, if it is KeywordAnalyzer, we will index/suggest phrases and if it is a WhitespaceAnalyzer we will index/suggest individual terms. Such as? What you put there is what you get so the fact that we are getting complete phrases as suggestions is the consequence of the choice above - the trie in this case is populated with phrases. If we populate it with tokens, then we can return per-token suggestions, again - losing the added value I mentioned above. My point was that SpellingResult is too coarse. It is a complete result (for all tokens given by "queryAnalyzerFieldType"). If that analyzer gives us multiple tokens then we must get suggestions for each. In that case returning a SpellingResult for each token is not right. Instead the Suggestor should combine suggestions for all tokens into a SpellingResult object. I don't have a suggestion on an alternative. Looks like we may need to invent a custom type which represents the (suggestion, frequency) pair. For now I'm sure that we do NOT want to use the impl. of RadixTree in this patch, because it doesn't support our use case - I'll prepare a patch that removes this impl. Other implementations seem comparable wrt. to the speed, based on casual tests using /usr/share/dict/words, but I didn't run any exact benchmarks yet. OK. Go ahead with the patch and I'll try to find some time to compare the two methods. What about DAWGs? Are we still considering them? Shouldn't we be creating a separate AutoSuggestComponent like the SpellCheckComponent havings its own prepare, process and inform functions? We could do that but as Andrej noted, we'd end up re-implementing a lot of its functionality. I'm not sure if it is worth it. I agree that it'd be odd using parameters prefixed with "spellcheck" for auto-suggest and it'd have been easier if it were vice-versa. Does anybody have a suggestion?
          Hide
          Andrzej Bialecki added a comment -

          Updated patch:

          • removed the broken RadixTree,
          • changed Suggester and Lookup API so that they don't join the tokens - instead they will use whatever tokens are produced by the analyzer. For now results are merged into a single SpellingResult.
          Show
          Andrzej Bialecki added a comment - Updated patch: removed the broken RadixTree, changed Suggester and Lookup API so that they don't join the tokens - instead they will use whatever tokens are produced by the analyzer. For now results are merged into a single SpellingResult.
          Hide
          Andrzej Bialecki added a comment -

          What about DAWGs? Are we still considering them?

          I would be happy to include DAWGs if someone were to implement them ...

          Show
          Andrzej Bialecki added a comment - What about DAWGs? Are we still considering them? I would be happy to include DAWGs if someone were to implement them ...
          Hide
          Brad Giaccio added a comment -

          We could do that but as Andrej noted, we'd end up re-implementing a lot of its functionality. I'm not sure if it is worth it. I agree that it'd be odd using parameters prefixed with "spellcheck" for auto-suggest and it'd have been easier if it were vice-versa. Does anybody have a suggestion?

          Couldn't you just extend the SpellCheckComponent, and make use of something like COMPONENT_NAME or PARAM_PREFIX in the param calls instead of the static string in SpellingParams? That way the autosuggestcomponent would have COMPONENT_NAME=autoSuggest and the spelling would have it set to spelling then let all the common code just live in the base class.

          Just a thought

          Show
          Brad Giaccio added a comment - We could do that but as Andrej noted, we'd end up re-implementing a lot of its functionality. I'm not sure if it is worth it. I agree that it'd be odd using parameters prefixed with "spellcheck" for auto-suggest and it'd have been easier if it were vice-versa. Does anybody have a suggestion? Couldn't you just extend the SpellCheckComponent, and make use of something like COMPONENT_NAME or PARAM_PREFIX in the param calls instead of the static string in SpellingParams? That way the autosuggestcomponent would have COMPONENT_NAME=autoSuggest and the spelling would have it set to spelling then let all the common code just live in the base class. Just a thought
          Hide
          David Smiley added a comment -

          For auto-complete, I use Solr's faceting via facet.prefix as described in the Solr book which I authored. How does this approach differ from that? I suspect that the use of radix-trees and what-not as described here will result in less memory (RAM) requirements. It would be interesting to see rough RAM requirements for the facet prefix approach based on the number of distinct terms... though only a fraction of the term space ends up being auto-completed.

          Show
          David Smiley added a comment - For auto-complete, I use Solr's faceting via facet.prefix as described in the Solr book which I authored. How does this approach differ from that? I suspect that the use of radix-trees and what-not as described here will result in less memory (RAM) requirements. It would be interesting to see rough RAM requirements for the facet prefix approach based on the number of distinct terms... though only a fraction of the term space ends up being auto-completed.
          Hide
          Yonik Seeley added a comment -

          Where are we on this - do people feel it's ready to commit?
          We probably want to add some unit tests too, and some documentation on the wiki at some point.

          AFAIK, we're limited to one spellcheck component per request handler - that should be OK though, since presumably this is meant to be used on it's own, right? What is the recommended/default configuration? We should probably add it as a /autocomplete handler in the example server.

          Does this currently work with phrases?

          Show
          Yonik Seeley added a comment - Where are we on this - do people feel it's ready to commit? We probably want to add some unit tests too, and some documentation on the wiki at some point. AFAIK, we're limited to one spellcheck component per request handler - that should be OK though, since presumably this is meant to be used on it's own, right? What is the recommended/default configuration? We should probably add it as a /autocomplete handler in the example server. Does this currently work with phrases?
          Hide
          Shalin Shekhar Mangar added a comment -

          Where are we on this - do people feel it's ready to commit?

          It has been some time since I looked at it but I don't feel it is ready. Using it through spellcheck works but specifying spell check params feels odd. Also, I don't know how well it compares to regular TermsComponent or facet.prefix searches in terms of memory and cpu cost.

          Show
          Shalin Shekhar Mangar added a comment - Where are we on this - do people feel it's ready to commit? It has been some time since I looked at it but I don't feel it is ready. Using it through spellcheck works but specifying spell check params feels odd. Also, I don't know how well it compares to regular TermsComponent or facet.prefix searches in terms of memory and cpu cost.
          Hide
          Jan Høydahl added a comment -

          Found this excellent article on usability of auto suggest: http://blog.twigkit.com/search-suggestions-part-1/

          A bit more talking about use cases could help us assert that we're solving the right problem

          One complex use case could be:

          • First display at most 0-1 suggestion if some terms are mis-spelled.
          • Then display 0-3 suggestions from the categories field, ordered by most docs in that category
          • Then display 0-10 generic suggestions based on fields "title,keywords" ordered by relevancy and tf/idf

          Yet another use case (for shopping sites) is:

          • First display 0-3 actual products, if there is an exact match on product name. These suggestsion are of type "instant result", and must return to frontend all data necessary to display a preview of the instant result. Clicking this suggestion takes the user directly to product page.
          • Then display 0-10 suggestions from an index (separate Solr core?) containing actual user queries, with offensive content filtered out, sort by relevancy, boost by frequency

          One way to back the suggest from user query log is through a full-blown solr core, where you in addition to the terms also index meta data such as usage frequency. Your auto suggest query could then benefit from all of Solr's power in ranking etc.

          Do complex scenarios like this belong within one request to one instance of autosuggest component? Would be very powerful. Or is it better to require users to build a proxy between frontend and Solr which issues multiple requests to multiple autosuggest URLs and/or queries to ordinary indices?

          Show
          Jan Høydahl added a comment - Found this excellent article on usability of auto suggest: http://blog.twigkit.com/search-suggestions-part-1/ A bit more talking about use cases could help us assert that we're solving the right problem One complex use case could be: First display at most 0-1 suggestion if some terms are mis-spelled. Then display 0-3 suggestions from the categories field, ordered by most docs in that category Then display 0-10 generic suggestions based on fields "title,keywords" ordered by relevancy and tf/idf Yet another use case (for shopping sites) is: First display 0-3 actual products, if there is an exact match on product name. These suggestsion are of type "instant result", and must return to frontend all data necessary to display a preview of the instant result. Clicking this suggestion takes the user directly to product page. Then display 0-10 suggestions from an index (separate Solr core?) containing actual user queries, with offensive content filtered out, sort by relevancy, boost by frequency One way to back the suggest from user query log is through a full-blown solr core, where you in addition to the terms also index meta data such as usage frequency. Your auto suggest query could then benefit from all of Solr's power in ranking etc. Do complex scenarios like this belong within one request to one instance of autosuggest component? Would be very powerful. Or is it better to require users to build a proxy between frontend and Solr which issues multiple requests to multiple autosuggest URLs and/or queries to ordinary indices?
          Hide
          Andrzej Bialecki added a comment -

          I would lean towards the latter - complex do-it-all components often suffer from creeping featuritis and insufficient testing/maintenance, because there are few users that use all their features, and few developers that understand how they work. I subscribe to the Unix philosophy - do one thing, and do it right, so I think that if we can implement autosuggest that works well from the technical POV, then it will become a reliable component that you can combine in many creative ways to satisfy different scenarios, of which there are likely many more than what you described ...

          Show
          Andrzej Bialecki added a comment - I would lean towards the latter - complex do-it-all components often suffer from creeping featuritis and insufficient testing/maintenance, because there are few users that use all their features, and few developers that understand how they work. I subscribe to the Unix philosophy - do one thing, and do it right, so I think that if we can implement autosuggest that works well from the technical POV, then it will become a reliable component that you can combine in many creative ways to satisfy different scenarios, of which there are likely many more than what you described ...
          Hide
          Grant Ingersoll added a comment -

          What's the status on this?

          Show
          Grant Ingersoll added a comment - What's the status on this?
          Hide
          jonas stock added a comment -

          hello.
          i have some trouble to apply this patch.
          can anyone send me a little how to for windows ?

          Show
          jonas stock added a comment - hello. i have some trouble to apply this patch. can anyone send me a little how to for windows ?
          Hide
          Andrzej Bialecki added a comment -

          Updated patch. This version of the patch is relative to trunk/. It supports onlyMorePopular, and includes a unit test with a small benchmark.

          Example results for 100,000 keys in the dictionary - size is in bytes, times are totals for all 100k lookups.

          JaspellLookup:	size=81078997	buildTime=1347	lookupTime=1431
          TSTLookup:	size=53453696	buildTime=316	lookupTime=1453
          
          Show
          Andrzej Bialecki added a comment - Updated patch. This version of the patch is relative to trunk/. It supports onlyMorePopular, and includes a unit test with a small benchmark. Example results for 100,000 keys in the dictionary - size is in bytes, times are totals for all 100k lookups. JaspellLookup: size=81078997 buildTime=1347 lookupTime=1431 TSTLookup: size=53453696 buildTime=316 lookupTime=1453
          Hide
          Grant Ingersoll added a comment -

          And what are the units for the time? I assume ms. Do you have any tests for the "average" case? i.e. lookup of a top 5-10 for a given prefix?

          Show
          Grant Ingersoll added a comment - And what are the units for the time? I assume ms. Do you have any tests for the "average" case? i.e. lookup of a top 5-10 for a given prefix?
          Hide
          Andrzej Bialecki added a comment -

          Yes, sorry, it's [ms] of elapsed time for adding 100k strings or looking up 100k strings, though I admit I was lazy and used the same strings for lookup as I did for the build phase ... I'll change this so that it properly looks up prefixes and I'll post new results.

          Show
          Andrzej Bialecki added a comment - Yes, sorry, it's [ms] of elapsed time for adding 100k strings or looking up 100k strings, though I admit I was lazy and used the same strings for lookup as I did for the build phase ... I'll change this so that it properly looks up prefixes and I'll post new results.
          Hide
          Andrzej Bialecki added a comment -

          Updated patch - previous version had a missing "else" in TSTLookup.

          Improved benchmark - the other results are useless, the running time was too short. Also, this time the lookup keys are strict prefixes between 1/3 and 1/2 length of the key. Result counts are cross-validated between JaSpell and TST implementations.

          Here are the new timings (times in [ms] for 100k items):

          JaspellLookup:	buildTime=448	lookupTime=1316
          JaspellLookup:	buildTime=379	lookupTime=1073
          JaspellLookup:	buildTime=399	lookupTime=709
          JaspellLookup:	buildTime=405	lookupTime=698
          JaspellLookup:	buildTime=454	lookupTime=758
          JaspellLookup:	buildTime=451	lookupTime=746
          JaspellLookup:	buildTime=436	lookupTime=886
          JaspellLookup:	buildTime=424	lookupTime=696
          JaspellLookup:	buildTime=402	lookupTime=697
          JaspellLookup:	buildTime=415	lookupTime=1156
          JaspellLookup:	buildTime=413	lookupTime=693
          JaspellLookup:	buildTime=429	lookupTime=698
          JaspellLookup:	buildTime=411	lookupTime=885
          JaspellLookup:	buildTime=402	lookupTime=688
          JaspellLookup:	buildTime=398	lookupTime=691
          JaspellLookup:	buildTime=405	lookupTime=1152
          JaspellLookup:	buildTime=405	lookupTime=695
          JaspellLookup:	buildTime=410	lookupTime=1009
          JaspellLookup:	buildTime=409	lookupTime=891
          JaspellLookup:	buildTime=400	lookupTime=685
          TSTLookup:	buildTime=185	lookupTime=289
          TSTLookup:	buildTime=161	lookupTime=427
          TSTLookup:	buildTime=173	lookupTime=311
          TSTLookup:	buildTime=183	lookupTime=304
          TSTLookup:	buildTime=177	lookupTime=311
          TSTLookup:	buildTime=175	lookupTime=287
          TSTLookup:	buildTime=173	lookupTime=431
          TSTLookup:	buildTime=161	lookupTime=278
          TSTLookup:	buildTime=161	lookupTime=282
          TSTLookup:	buildTime=177	lookupTime=453
          TSTLookup:	buildTime=157	lookupTime=286
          TSTLookup:	buildTime=160	lookupTime=432
          TSTLookup:	buildTime=161	lookupTime=281
          TSTLookup:	buildTime=160	lookupTime=275
          TSTLookup:	buildTime=160	lookupTime=454
          TSTLookup:	buildTime=178	lookupTime=298
          TSTLookup:	buildTime=181	lookupTime=289
          TSTLookup:	buildTime=159	lookupTime=432
          TSTLookup:	buildTime=164	lookupTime=285
          TSTLookup:	buildTime=159	lookupTime=480
          
          Show
          Andrzej Bialecki added a comment - Updated patch - previous version had a missing "else" in TSTLookup. Improved benchmark - the other results are useless, the running time was too short. Also, this time the lookup keys are strict prefixes between 1/3 and 1/2 length of the key. Result counts are cross-validated between JaSpell and TST implementations. Here are the new timings (times in [ms] for 100k items): JaspellLookup: buildTime=448 lookupTime=1316 JaspellLookup: buildTime=379 lookupTime=1073 JaspellLookup: buildTime=399 lookupTime=709 JaspellLookup: buildTime=405 lookupTime=698 JaspellLookup: buildTime=454 lookupTime=758 JaspellLookup: buildTime=451 lookupTime=746 JaspellLookup: buildTime=436 lookupTime=886 JaspellLookup: buildTime=424 lookupTime=696 JaspellLookup: buildTime=402 lookupTime=697 JaspellLookup: buildTime=415 lookupTime=1156 JaspellLookup: buildTime=413 lookupTime=693 JaspellLookup: buildTime=429 lookupTime=698 JaspellLookup: buildTime=411 lookupTime=885 JaspellLookup: buildTime=402 lookupTime=688 JaspellLookup: buildTime=398 lookupTime=691 JaspellLookup: buildTime=405 lookupTime=1152 JaspellLookup: buildTime=405 lookupTime=695 JaspellLookup: buildTime=410 lookupTime=1009 JaspellLookup: buildTime=409 lookupTime=891 JaspellLookup: buildTime=400 lookupTime=685 TSTLookup: buildTime=185 lookupTime=289 TSTLookup: buildTime=161 lookupTime=427 TSTLookup: buildTime=173 lookupTime=311 TSTLookup: buildTime=183 lookupTime=304 TSTLookup: buildTime=177 lookupTime=311 TSTLookup: buildTime=175 lookupTime=287 TSTLookup: buildTime=173 lookupTime=431 TSTLookup: buildTime=161 lookupTime=278 TSTLookup: buildTime=161 lookupTime=282 TSTLookup: buildTime=177 lookupTime=453 TSTLookup: buildTime=157 lookupTime=286 TSTLookup: buildTime=160 lookupTime=432 TSTLookup: buildTime=161 lookupTime=281 TSTLookup: buildTime=160 lookupTime=275 TSTLookup: buildTime=160 lookupTime=454 TSTLookup: buildTime=178 lookupTime=298 TSTLookup: buildTime=181 lookupTime=289 TSTLookup: buildTime=159 lookupTime=432 TSTLookup: buildTime=164 lookupTime=285 TSTLookup: buildTime=159 lookupTime=480
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Ankul Garg added a comment -

          Are we currently using the UnsortedTermFreqIteratorWrapper to build a randomly balanced ternary search tree? How about building a balanced tree using SortedIterator in manner similar to the balancedTree( ) function that I have implemented in TSTAutocomplete. That may further improve the results. What say?

          Show
          Ankul Garg added a comment - Are we currently using the UnsortedTermFreqIteratorWrapper to build a randomly balanced ternary search tree? How about building a balanced tree using SortedIterator in manner similar to the balancedTree( ) function that I have implemented in TSTAutocomplete. That may further improve the results. What say?
          Hide
          Andrzej Bialecki added a comment -

          Are we currently using the UnsortedTermFreqIteratorWrapper to build a randomly balanced ternary search tree?

          Yes.

          How about building a balanced tree using SortedIterator in manner similar to the balancedTree( ) function that I have implemented in TSTAutocomplete. That may further improve the results. What say?

          Indeed UnsortedTFIW buffers all input strings anyway, so we might as well do it this way. I'll post results shortly.

          Show
          Andrzej Bialecki added a comment - Are we currently using the UnsortedTermFreqIteratorWrapper to build a randomly balanced ternary search tree? Yes. How about building a balanced tree using SortedIterator in manner similar to the balancedTree( ) function that I have implemented in TSTAutocomplete. That may further improve the results. What say? Indeed UnsortedTFIW buffers all input strings anyway, so we might as well do it this way. I'll post results shortly.
          Hide
          Ankul Garg added a comment -

          Here are the results comparing TSTLookup for UnsortedTFIT and SortedTFIT:

          For UnsortedTFIT

          TSTLookup: buildTime=384 lookupTime=418
          TSTLookup: buildTime=272 lookupTime=414
          TSTLookup: buildTime=402 lookupTime=611
          TSTLookup: buildTime=383 lookupTime=372
          TSTLookup: buildTime=421 lookupTime=569
          TSTLookup: buildTime=317 lookupTime=519
          TSTLookup: buildTime=388 lookupTime=588
          TSTLookup: buildTime=269 lookupTime=508
          TSTLookup: buildTime=267 lookupTime=454
          TSTLookup: buildTime=380 lookupTime=613
          TSTLookup: buildTime=312 lookupTime=360
          TSTLookup: buildTime=279 lookupTime=536
          TSTLookup: buildTime=337 lookupTime=368
          TSTLookup: buildTime=356 lookupTime=368
          TSTLookup: buildTime=343 lookupTime=517
          TSTLookup: buildTime=383 lookupTime=541
          TSTLookup: buildTime=277 lookupTime=590
          TSTLookup: buildTime=300 lookupTime=720
          TSTLookup: buildTime=334 lookupTime=464
          TSTLookup: buildTime=381 lookupTime=613

          For SortedTFIT
          TSTLookup: buildTime=532 lookupTime=382
          TSTLookup: buildTime=507 lookupTime=367
          TSTLookup: buildTime=535 lookupTime=433
          TSTLookup: buildTime=377 lookupTime=365
          TSTLookup: buildTime=513 lookupTime=364
          TSTLookup: buildTime=355 lookupTime=419
          TSTLookup: buildTime=359 lookupTime=418
          TSTLookup: buildTime=371 lookupTime=362
          TSTLookup: buildTime=475 lookupTime=461
          TSTLookup: buildTime=371 lookupTime=422
          TSTLookup: buildTime=360 lookupTime=362
          TSTLookup: buildTime=357 lookupTime=421
          TSTLookup: buildTime=511 lookupTime=419
          TSTLookup: buildTime=482 lookupTime=360
          TSTLookup: buildTime=532 lookupTime=430
          TSTLookup: buildTime=465 lookupTime=458
          TSTLookup: buildTime=534 lookupTime=359
          TSTLookup: buildTime=358 lookupTime=420
          TSTLookup: buildTime=359 lookupTime=418
          TSTLookup: buildTime=356 lookupTime=361

          The lookup time in second case seems to have further reduced with a slight overhead for build time.

          Show
          Ankul Garg added a comment - Here are the results comparing TSTLookup for UnsortedTFIT and SortedTFIT: For UnsortedTFIT TSTLookup: buildTime=384 lookupTime=418 TSTLookup: buildTime=272 lookupTime=414 TSTLookup: buildTime=402 lookupTime=611 TSTLookup: buildTime=383 lookupTime=372 TSTLookup: buildTime=421 lookupTime=569 TSTLookup: buildTime=317 lookupTime=519 TSTLookup: buildTime=388 lookupTime=588 TSTLookup: buildTime=269 lookupTime=508 TSTLookup: buildTime=267 lookupTime=454 TSTLookup: buildTime=380 lookupTime=613 TSTLookup: buildTime=312 lookupTime=360 TSTLookup: buildTime=279 lookupTime=536 TSTLookup: buildTime=337 lookupTime=368 TSTLookup: buildTime=356 lookupTime=368 TSTLookup: buildTime=343 lookupTime=517 TSTLookup: buildTime=383 lookupTime=541 TSTLookup: buildTime=277 lookupTime=590 TSTLookup: buildTime=300 lookupTime=720 TSTLookup: buildTime=334 lookupTime=464 TSTLookup: buildTime=381 lookupTime=613 For SortedTFIT TSTLookup: buildTime=532 lookupTime=382 TSTLookup: buildTime=507 lookupTime=367 TSTLookup: buildTime=535 lookupTime=433 TSTLookup: buildTime=377 lookupTime=365 TSTLookup: buildTime=513 lookupTime=364 TSTLookup: buildTime=355 lookupTime=419 TSTLookup: buildTime=359 lookupTime=418 TSTLookup: buildTime=371 lookupTime=362 TSTLookup: buildTime=475 lookupTime=461 TSTLookup: buildTime=371 lookupTime=422 TSTLookup: buildTime=360 lookupTime=362 TSTLookup: buildTime=357 lookupTime=421 TSTLookup: buildTime=511 lookupTime=419 TSTLookup: buildTime=482 lookupTime=360 TSTLookup: buildTime=532 lookupTime=430 TSTLookup: buildTime=465 lookupTime=458 TSTLookup: buildTime=534 lookupTime=359 TSTLookup: buildTime=358 lookupTime=420 TSTLookup: buildTime=359 lookupTime=418 TSTLookup: buildTime=356 lookupTime=361 The lookup time in second case seems to have further reduced with a slight overhead for build time.
          Hide
          Ankul Garg added a comment -

          Andrzej, waiting for your results. After that we can finalize the more appropriate method and plan out the work that needs to be done further (if any).

          Show
          Ankul Garg added a comment - Andrzej, waiting for your results. After that we can finalize the more appropriate method and plan out the work that needs to be done further (if any).
          Hide
          Andrzej Bialecki added a comment -

          I modified the benchmark slightly - it takes much longer to execute, but results are more stable across runs.

          TSTLookup - size=53453696
          JaspellLookup - size=81078997
          * Running 100 iterations for JaspellLookup ...
            - warm-up 10 iterations...
            - main iterations: 10 20 30 40 50 60 70 80 90
          JaspellLookup: buildTime[ms]=412 lookupTime[ms]=1030
          * Running 100 iterations for TSTLookup ...
            - warm-up 10 iterations...
            - main iterations: 10 20 30 40 50 60 70 80 90
          TSTLookup: buildTime[ms]=405 lookupTime[ms]=288
          

          At this point I think we are making only minor incremental improvements, so if people feel this is a useful component ni its current shape then I propose to commit it.

          Show
          Andrzej Bialecki added a comment - I modified the benchmark slightly - it takes much longer to execute, but results are more stable across runs. TSTLookup - size=53453696 JaspellLookup - size=81078997 * Running 100 iterations for JaspellLookup ... - warm-up 10 iterations... - main iterations: 10 20 30 40 50 60 70 80 90 JaspellLookup: buildTime[ms]=412 lookupTime[ms]=1030 * Running 100 iterations for TSTLookup ... - warm-up 10 iterations... - main iterations: 10 20 30 40 50 60 70 80 90 TSTLookup: buildTime[ms]=405 lookupTime[ms]=288 At this point I think we are making only minor incremental improvements, so if people feel this is a useful component ni its current shape then I propose to commit it.
          Hide
          Andrzej Bialecki added a comment -

          Just to clarify: the buildTime above is how long it takes to build a trie with 100k elements. The lookupTime is how long it takes to perform 100k lookups using short keys (i.e. returning usually many more than 1 completion).

          Show
          Andrzej Bialecki added a comment - Just to clarify: the buildTime above is how long it takes to build a trie with 100k elements. The lookupTime is how long it takes to perform 100k lookups using short keys (i.e. returning usually many more than 1 completion).
          Hide
          Andrzej Bialecki added a comment -

          Minor changes - clean up imports, improve the Suggester.reload().

          Show
          Andrzej Bialecki added a comment - Minor changes - clean up imports, improve the Suggester.reload().
          Hide
          Ankul Garg added a comment -

          Another minor change in TSTLookup. Its better to assign root as null, so that on building the tree, the splitchar of root won't be null as it is the case currently.

          Show
          Ankul Garg added a comment - Another minor change in TSTLookup. Its better to assign root as null, so that on building the tree, the splitchar of root won't be null as it is the case currently.
          Hide
          Ankul Garg added a comment -

          Updated TSTAutocomplete balancedTree() too. Its fine now.

          Show
          Ankul Garg added a comment - Updated TSTAutocomplete balancedTree() too. Its fine now.
          Hide
          Ankul Garg added a comment -

          How about committing the component? Comments awaited!

          Show
          Ankul Garg added a comment - How about committing the component? Comments awaited!
          Hide
          David Smiley added a comment -

          (re-post)
          For auto-complete, I use Solr's faceting via facet.prefix as described in the Solr book which I authored. How does this approach differ from that? I suspect that the use of radix-trees and what-not as described here will result in less memory (RAM) requirements. It would be interesting to see rough RAM requirements for the facet prefix approach based on the number of distinct terms... though only a fraction of the term space ends up being auto-completed.

          Show
          David Smiley added a comment - (re-post) For auto-complete, I use Solr's faceting via facet.prefix as described in the Solr book which I authored. How does this approach differ from that? I suspect that the use of radix-trees and what-not as described here will result in less memory (RAM) requirements. It would be interesting to see rough RAM requirements for the facet prefix approach based on the number of distinct terms... though only a fraction of the term space ends up being auto-completed.
          Hide
          Andrzej Bialecki added a comment -

          I would expect the facet.prefix method to consume less additional RAM dedicated to this functionality (because field caches are shared among different components), but I doubt it could beat the performance of ternary tries.

          Also, in many situations it makes sense to populate an auto-complete component from a query log, and not from the index. If you want to offer phrase-based autocomplete then using facet.prefix method you would have to create an additional field populated with shingles, and then the RAM cost would grow tremendously, whereas with TST this increase would be moderate.

          Regarding specific numbers ... if you have a test setup for facet.prefix feel free to apply this patch and test it, we'd love to see your results - see also the numbers above for 100k random strings.

          Show
          Andrzej Bialecki added a comment - I would expect the facet.prefix method to consume less additional RAM dedicated to this functionality (because field caches are shared among different components), but I doubt it could beat the performance of ternary tries. Also, in many situations it makes sense to populate an auto-complete component from a query log, and not from the index. If you want to offer phrase-based autocomplete then using facet.prefix method you would have to create an additional field populated with shingles, and then the RAM cost would grow tremendously, whereas with TST this increase would be moderate. Regarding specific numbers ... if you have a test setup for facet.prefix feel free to apply this patch and test it, we'd love to see your results - see also the numbers above for 100k random strings.
          Hide
          Stefan Seidel added a comment -

          Hi. First of all, thanks for making this effort for a usable auto-suggest feature.

          I have a question which is (hopefully) relevant to all this: we tried using the facet method for autosuggest as presented in David's book about solr. It works well and usually fast, however, there are two issues: (a) initializing the faceting takes time (>300 sec. for 8.3 million docs here), which would be ok if not (b) after any commit (even without new documents added) the faceting needs exactly this time again. On a heavy load server with commits every 60 seconds, this doesn't work at all. Also, it makes solr use >1G of RAM, and I can get it into HeapSpace errors when multiple queries are running and the faceting hasn't been initialized.

          Question: has this scenario (frequent commits) been considered a use case when implementing these new components? Will the tree get rebuilt at every commit as well? (Judging from the previous numbers this should take about 10 sec. in our scenario?) I think this is a crucial question for the whole feature.

          Show
          Stefan Seidel added a comment - Hi. First of all, thanks for making this effort for a usable auto-suggest feature. I have a question which is (hopefully) relevant to all this: we tried using the facet method for autosuggest as presented in David's book about solr. It works well and usually fast, however, there are two issues: (a) initializing the faceting takes time (>300 sec. for 8.3 million docs here), which would be ok if not (b) after any commit (even without new documents added) the faceting needs exactly this time again. On a heavy load server with commits every 60 seconds, this doesn't work at all. Also, it makes solr use >1G of RAM, and I can get it into HeapSpace errors when multiple queries are running and the faceting hasn't been initialized. Question: has this scenario (frequent commits) been considered a use case when implementing these new components? Will the tree get rebuilt at every commit as well? (Judging from the previous numbers this should take about 10 sec. in our scenario?) I think this is a crucial question for the whole feature.
          Hide
          David Smiley added a comment -

          Stefan, just checking, are you using Solr 1.4? It contains efficiency improvements in this regard that are very important for this use case. Also, I recommend that your firstSearcher queries contain the faceting you need so that users never see this delay. I don't think this needs to go in newSearcher but I could be wrong.

          Show
          David Smiley added a comment - Stefan, just checking, are you using Solr 1.4? It contains efficiency improvements in this regard that are very important for this use case. Also, I recommend that your firstSearcher queries contain the faceting you need so that users never see this delay. I don't think this needs to go in newSearcher but I could be wrong.
          Hide
          muneeb ali added a comment -

          Hi, I have a very relevant question with regards to this feature. I am a beginner to solr so bare with my limited knowledge. I am following David's book to implement the auto complete feature, which uses facet.prefix to return the suggestions. I have two questions,

          1- To make use of solr-auto-complete, would I need to query solr everytime a letter is typed in the search box?
          2 - How does the above component differ from david's method in terms of implementation (a quick demo of schema/solrconfig would be appreciated).

          If there is any documentation on this, could anyone guide me to that please?

          Thanks in advance.

          Show
          muneeb ali added a comment - Hi, I have a very relevant question with regards to this feature. I am a beginner to solr so bare with my limited knowledge. I am following David's book to implement the auto complete feature, which uses facet.prefix to return the suggestions. I have two questions, 1- To make use of solr-auto-complete, would I need to query solr everytime a letter is typed in the search box? 2 - How does the above component differ from david's method in terms of implementation (a quick demo of schema/solrconfig would be appreciated). If there is any documentation on this, could anyone guide me to that please? Thanks in advance.
          Hide
          Andy added a comment -

          Does this handle non-prefix matches?

          For example, if user types "guit", I want to suggest both "guitar" and "electric guitar".

          Would this patch do that?

          Show
          Andy added a comment - Does this handle non-prefix matches? For example, if user types "guit", I want to suggest both "guitar" and "electric guitar". Would this patch do that?
          Hide
          Ankul Garg added a comment -

          Stefan, yes the tree needs to be re-built at each commit. But, I think the problem can be overcome. This can be done by implementing the ternary tree to be self-balancing. Thus, inserting and deleting strings won't affect the balancing of the tree. But, does the commit operation gives us a new set of tokens altogether, or can we have separate sets of newly added tokens and deleted tokens? The latter can help the trick come into action.

          Show
          Ankul Garg added a comment - Stefan, yes the tree needs to be re-built at each commit. But, I think the problem can be overcome. This can be done by implementing the ternary tree to be self-balancing. Thus, inserting and deleting strings won't affect the balancing of the tree. But, does the commit operation gives us a new set of tokens altogether, or can we have separate sets of newly added tokens and deleted tokens? The latter can help the trick come into action.
          Hide
          Stefan Seidel added a comment -

          @David, yes we'®e using Solr1.4. And of course, putting the facet query in firstSearcher only alleviates the problem of several searches trashing the solr server, because they all need to wait for the autowarming, but it still takes ~240 seconds to initialize faceting, and this is not usable with commits every 60 seconds.

          @Ankul, thanks for the analysis. It would be great if the commit changes could be merged on the fly, but what about an option to not rebuild the tree at every commit? Like the spellchecker, the autosuggest can usually drift from the main index until an optimize is done (e.g. during night time). Then we wouldn't get the most up-to-date suggestions, but this is usually not really needed.

          Show
          Stefan Seidel added a comment - @David, yes we'®e using Solr1.4. And of course, putting the facet query in firstSearcher only alleviates the problem of several searches trashing the solr server, because they all need to wait for the autowarming, but it still takes ~240 seconds to initialize faceting, and this is not usable with commits every 60 seconds. @Ankul, thanks for the analysis. It would be great if the commit changes could be merged on the fly, but what about an option to not rebuild the tree at every commit? Like the spellchecker, the autosuggest can usually drift from the main index until an optimize is done (e.g. during night time). Then we wouldn't get the most up-to-date suggestions, but this is usually not really needed.
          Hide
          Ankul Garg added a comment -

          @Stefan, for not re-building the tree at each commit, we need to have a separate List of newly added tokens. Is that possible? My idea here to have a separate list of newly added tokens is that each time a commit is performed, we would be having a list of tokens which have been newly added and they can be inserted in the tree still keeping the tree balanced. This can be done by the implementation of Self-balancing ternary search tree. But, first I need to confirm whether we can have any such list of newly added tokens or not?

          Show
          Ankul Garg added a comment - @Stefan, for not re-building the tree at each commit, we need to have a separate List of newly added tokens. Is that possible? My idea here to have a separate list of newly added tokens is that each time a commit is performed, we would be having a list of tokens which have been newly added and they can be inserted in the tree still keeping the tree balanced. This can be done by the implementation of Self-balancing ternary search tree. But, first I need to confirm whether we can have any such list of newly added tokens or not?
          Hide
          Andrzej Bialecki added a comment -

          Newly added terms will come from the latest segments, if the tree is populated from an IndexReader. Also, Solr uses IndexReader.reopen() to catch up with new data in new segments. From this it follows that as long as the IndexReader instance remains the same (i.e. it comes from reopen) if we keep track of the previous maxDoc then we can also catch up with new terms by processing only the latest segments (using IndexReader.getSequentialSubReaders() and IndexReader.getSubReaderDocBase()).

          So, the short answer to your question is "yes"

          Show
          Andrzej Bialecki added a comment - Newly added terms will come from the latest segments, if the tree is populated from an IndexReader. Also, Solr uses IndexReader.reopen() to catch up with new data in new segments. From this it follows that as long as the IndexReader instance remains the same (i.e. it comes from reopen) if we keep track of the previous maxDoc then we can also catch up with new terms by processing only the latest segments (using IndexReader.getSequentialSubReaders() and IndexReader.getSubReaderDocBase()). So, the short answer to your question is "yes"
          Hide
          Ankul Garg added a comment -

          @Andrzej, am working on the implementation of self-balancing TST. The idea is similar to height balancing as used in AVL trees. Hope to get back soon with the implementation

          Show
          Ankul Garg added a comment - @Andrzej, am working on the implementation of self-balancing TST. The idea is similar to height balancing as used in AVL trees. Hope to get back soon with the implementation
          Hide
          Andrzej Bialecki added a comment -

          Play the catch-up game with the trunk ... This patch includes a test case + resources, and fixes an error in conversion from priority queue to lookup results.

          I would appreciate a review - if there are no objections I'd like to commit this soon.

          Show
          Andrzej Bialecki added a comment - Play the catch-up game with the trunk ... This patch includes a test case + resources, and fixes an error in conversion from priority queue to lookup results. I would appreciate a review - if there are no objections I'd like to commit this soon.
          Hide
          Andrzej Bialecki added a comment -

          Previous patch had some unrelated changes.

          Show
          Andrzej Bialecki added a comment - Previous patch had some unrelated changes.
          Hide
          Robert Muir added a comment -

          Hi Andrzej, i have a concern about unicode support here:

          in JaspellTernarySearchTrie.compareCharsAlphabetically there is some code like:

          if (cCompare2 >= 65) {
               if (cCompare2 < 89) {
                  cCompare = (2 * cCompare2) - 65;
                } else if (cCompare2 < 97) {
                  cCompare = cCompare2 + 24;
                } else if (cCompare2 < 121) {
                  cCompare = (2 * cCompare2) - 128;
                } else cCompare = cCompare2;
              } else cCompare = cCompare2;
          ...
          

          Could we consider a more unicode-friendly approach, such as simply comparing Character.toLowerCase(cCompare2) with Character.toLowerCase(cRef) ?

          Show
          Robert Muir added a comment - Hi Andrzej, i have a concern about unicode support here: in JaspellTernarySearchTrie.compareCharsAlphabetically there is some code like: if (cCompare2 >= 65) { if (cCompare2 < 89) { cCompare = (2 * cCompare2) - 65; } else if (cCompare2 < 97) { cCompare = cCompare2 + 24; } else if (cCompare2 < 121) { cCompare = (2 * cCompare2) - 128; } else cCompare = cCompare2; } else cCompare = cCompare2; ... Could we consider a more unicode-friendly approach, such as simply comparing Character.toLowerCase(cCompare2) with Character.toLowerCase(cRef) ?
          Hide
          Andrzej Bialecki added a comment -

          Yeah, this code comes from Jaspell... if we change it then it needs a unit test IMHO. I'll see what I can do.

          Show
          Andrzej Bialecki added a comment - Yeah, this code comes from Jaspell... if we change it then it needs a unit test IMHO. I'll see what I can do.
          Hide
          Robert Muir added a comment -

          Andrzej, it might not be a problem, looking at how the result is used (it seems only the sign is significant)?

          But i have trouble understanding what the point of all the complexity in this method is... if its just as its documented, it seems it could be much simpler: (eg b-a). So I feel like I am missing something.

          char A char B result
          A A 0
          a a 0
          A a -1
          a A 1
          Ã ã -32
          ã Ã 32
          л Л 32
          Show
          Robert Muir added a comment - Andrzej, it might not be a problem, looking at how the result is used (it seems only the sign is significant)? But i have trouble understanding what the point of all the complexity in this method is... if its just as its documented, it seems it could be much simpler: (eg b-a). So I feel like I am missing something. char A char B result A A 0 a a 0 A a -1 a A 1 Ã ã -32 ã Ã 32 л Л 32
          Hide
          Grant Ingersoll added a comment -

          is it the case that I have build the TST (i.e. issue spellcheck.build=true) every time I start up Solr?

          Show
          Grant Ingersoll added a comment - is it the case that I have build the TST (i.e. issue spellcheck.build=true) every time I start up Solr?
          Hide
          Andrzej Bialecki added a comment -

          Unfortunately yes, that's the way it works at the moment. I added hooks for storing/loading tries but they are dummy placeholders for now. Contributions are welcome

          Show
          Andrzej Bialecki added a comment - Unfortunately yes, that's the way it works at the moment. I added hooks for storing/loading tries but they are dummy placeholders for now. Contributions are welcome
          Hide
          Grant Ingersoll added a comment -

          I think we should at least open an issue for it and link to this one when this one gets committed, as it takes a while to build.

          Show
          Grant Ingersoll added a comment - I think we should at least open an issue for it and link to this one when this one gets committed, as it takes a while to build.
          Hide
          Andrzej Bialecki added a comment -

          Robert,

          But i have trouble understanding what the point of all the complexity in this method is... if its just as its documented, it seems it could be much simpler: (eg b-a). So I feel like I am missing something.

          I don't get it either... it's a borrowed code after all Anyway, I replaced this method with the following:

            private static int compareCharsAlphabetically(char cCompare2, char cRef) {
              return Character.toLowerCase(cCompare2) - Character.toLowerCase(cRef);
            }
          

          and all tests pass, including those that test for correctness of returned suggestions and consistency between Jaspell and TST. I also ran testBenchmark() and differences in timings are negligible.

          Grant,

          I think we should at least open an issue for it and link to this one when this one gets committed, as it takes a while to build.

          Yes, I'll open an issue when this gets committed.

          If there are no further objections I'd like to commit this.

          Show
          Andrzej Bialecki added a comment - Robert, But i have trouble understanding what the point of all the complexity in this method is... if its just as its documented, it seems it could be much simpler: (eg b-a). So I feel like I am missing something. I don't get it either... it's a borrowed code after all Anyway, I replaced this method with the following: private static int compareCharsAlphabetically( char cCompare2, char cRef) { return Character .toLowerCase(cCompare2) - Character .toLowerCase(cRef); } and all tests pass, including those that test for correctness of returned suggestions and consistency between Jaspell and TST. I also ran testBenchmark() and differences in timings are negligible. Grant, I think we should at least open an issue for it and link to this one when this one gets committed, as it takes a while to build. Yes, I'll open an issue when this gets committed. If there are no further objections I'd like to commit this.
          Hide
          Andrzej Bialecki added a comment -

          Updated patch.

          Show
          Andrzej Bialecki added a comment - Updated patch.
          Hide
          Robert Muir added a comment -

          and all tests pass, including those that test for correctness of returned suggestions and consistency between Jaspell and TST. I also ran testBenchmark() and differences in timings are negligible.

          Thanks Andrzej!

          Show
          Robert Muir added a comment - and all tests pass, including those that test for correctness of returned suggestions and consistency between Jaspell and TST. I also ran testBenchmark() and differences in timings are negligible. Thanks Andrzej!
          Hide
          Andrzej Bialecki added a comment -

          Yet another attempt to catch the peleton leaders... This is the version to be committed now.

          Show
          Andrzej Bialecki added a comment - Yet another attempt to catch the peleton leaders... This is the version to be committed now.
          Hide
          Andrzej Bialecki added a comment -

          Committed to trunk in rev. 988120. Thanks to all who contributed to this issue!

          Show
          Andrzej Bialecki added a comment - Committed to trunk in rev. 988120. Thanks to all who contributed to this issue!
          Hide
          Lance Norskog added a comment -

          Would it make sense to add this to 3.x?

          Show
          Lance Norskog added a comment - Would it make sense to add this to 3.x?
          Hide
          Yonik Seeley added a comment -

          Here's a patch to the tests:

          • adds ASL header
          • removes @Override from interface methods
          • converts to Junit4
          • greatly simplifies things by using standard test utility methods and going through the front door (i.e. there's usually no reason to manually look up a request handler).
          • removes the testReload altogether.. it didn't seem to test anything that wasn't tested by the first test method (i.e. that a commit will rebuild).
          Show
          Yonik Seeley added a comment - Here's a patch to the tests: adds ASL header removes @Override from interface methods converts to Junit4 greatly simplifies things by using standard test utility methods and going through the front door (i.e. there's usually no reason to manually look up a request handler). removes the testReload altogether.. it didn't seem to test anything that wasn't tested by the first test method (i.e. that a commit will rebuild).
          Hide
          Mark Miller added a comment -

          removes @Override from interface methods

          whys that when we are on java 6 now?

          Show
          Mark Miller added a comment - removes @Override from interface methods whys that when we are on java 6 now?
          Hide
          Yonik Seeley added a comment -

          Makes it easier to backport to 3x later if desired. Not a big deal - it's test code after all.
          And from what Uwe says... even though Java6 allows it, it's still "incorrect"

          Show
          Yonik Seeley added a comment - Makes it easier to backport to 3x later if desired. Not a big deal - it's test code after all. And from what Uwe says... even though Java6 allows it, it's still "incorrect"
          Hide
          Grant Ingersoll added a comment -

          Backports to 3.x (includes Andrzej's final patch, plus Yonik's update). Weird thing, though, the SuggesterTest no longer passes. It suggests accidentally instead of the ones given. The only thing that is different is the HighFreqDictionary implementation of freq().

          Weird. Maybe I'm missing something in terms of diffs between term enumeration between the 3.x and trunk.

          Show
          Grant Ingersoll added a comment - Backports to 3.x (includes Andrzej's final patch, plus Yonik's update). Weird thing, though, the SuggesterTest no longer passes. It suggests accidentally instead of the ones given. The only thing that is different is the HighFreqDictionary implementation of freq(). Weird. Maybe I'm missing something in terms of diffs between term enumeration between the 3.x and trunk.
          Hide
          Andrzej Bialecki added a comment -

          Grant, you missed changes to HighFrequencyDictionary#HighFrequencyIterator. Here's the patch that adds these changes, all tests pass now.

          If there are no objections I'll commit this shortly.

          Show
          Andrzej Bialecki added a comment - Grant, you missed changes to HighFrequencyDictionary#HighFrequencyIterator. Here's the patch that adds these changes, all tests pass now. If there are no objections I'll commit this shortly.
          Hide
          Koji Sekiguchi added a comment -

          If there are no objections I'll commit this shortly.

          +1. BTW, some of newly added files in the patch are missing Apache License header.

          Show
          Koji Sekiguchi added a comment - If there are no objections I'll commit this shortly. +1. BTW, some of newly added files in the patch are missing Apache License header.
          Hide
          Andrzej Bialecki added a comment -

          I added license headers and committed the patch in rev. 993367 - thank you!

          Show
          Andrzej Bialecki added a comment - I added license headers and committed the patch in rev. 993367 - thank you!
          Hide
          John Beck added a comment - - edited

          If you wish to use the comment text you have typed (shown below), please copy it now. This text will be lost when you leave this screen.
          Hey guys, really nice work on this, it is extremely fast! In production right now we're seeing 200-700ms latency, and with this I typically get between 1ms and 10ms.

          I did find one issue though. I'm going to use this with a dictionary of 150k medical terms, and it's works, except for when my query happens to be a popular starting word.

          If I use this as a dictionary,

           
          Hepatitis B Viruses, Duck 
          Hepatitis B e Antigens 
          Hepatitis B virus 
          Hepatitis B, Chronic 
          Hepatitis Be Antigens 
          Hepatitis C 
          Hepatitis C Antibodies 
          Hepatitis C Antigen 
          

          And then search for Hepatitis C,

           
          curl "http://localhost:8982/solr/suggest/?spellcheck=true&spellcheck.dictionary=suggest&spellcheck.extendedResults=true&spellcheck.count=5&q=Hepatitis%20C" 
          <response> 
          <lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int></lst><str name="command">build</str><lst name="spellcheck">
          <lst name="suggestions"><lst name="Hepatitis"><int name="numFound">5</int><int name="startOffset">0</int><int name="endOffset">9</int>
          <arr name="suggestion"><str>hepatitis b e antigens</str><str>hepatitis b virus</str><str>hepatitis b viruses, duck</str><str>hepatitis b, chronic</str>
          <str>hepatitis be antigens</str></arr></lst></lst></lst> 
          </response> 
          

          You can see it never makes it to Hepatitis C since it's #6 in that dictionary, and I'm limiting the results to 5.

          When I bump spellcheck.count=6, then I get the very first Hepatitis C result but not the rest.

          So there are about 2500 terms that start with "Receptor" and I don't want to have to bump it to 3000 results. Is there anything else that can be done?

          Show
          John Beck added a comment - - edited If you wish to use the comment text you have typed (shown below), please copy it now. This text will be lost when you leave this screen. Hey guys, really nice work on this, it is extremely fast! In production right now we're seeing 200-700ms latency, and with this I typically get between 1ms and 10ms. I did find one issue though. I'm going to use this with a dictionary of 150k medical terms, and it's works, except for when my query happens to be a popular starting word. If I use this as a dictionary, Hepatitis B Viruses, Duck Hepatitis B e Antigens Hepatitis B virus Hepatitis B, Chronic Hepatitis Be Antigens Hepatitis C Hepatitis C Antibodies Hepatitis C Antigen And then search for Hepatitis C, curl "http://localhost:8982/solr/suggest/?spellcheck=true&spellcheck.dictionary=suggest&spellcheck.extendedResults=true&spellcheck.count=5&q=Hepatitis%20C" <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int></lst><str name="command">build</str><lst name="spellcheck"> <lst name="suggestions"><lst name="Hepatitis"><int name="numFound">5</int><int name="startOffset">0</int><int name="endOffset">9</int> <arr name="suggestion"><str>hepatitis b e antigens</str><str>hepatitis b virus</str><str>hepatitis b viruses, duck</str><str>hepatitis b, chronic</str> <str>hepatitis be antigens</str></arr></lst></lst></lst> </response> You can see it never makes it to Hepatitis C since it's #6 in that dictionary, and I'm limiting the results to 5. When I bump spellcheck.count=6, then I get the very first Hepatitis C result but not the rest. So there are about 2500 terms that start with "Receptor" and I don't want to have to bump it to 3000 results. Is there anything else that can be done?
          Hide
          Yonik Seeley added a comment -

          Is there any documentation for this autosuggest component? If so, I can't seem to find it.

          Show
          Yonik Seeley added a comment - Is there any documentation for this autosuggest component? If so, I can't seem to find it.
          Hide
          Mark Miller added a comment -

          yeah, with 32 watchers and 10 votes, some doc might be appreciated Tough to find out how to set this up by looking through all these jira comments. We need some tips/warnings about how this works too - how much RAM it might take on a large index (eg warn that it could be a lot), how to reduce that if you need to (threshold), and how the autocomplete index is not preserved over core reloads, etc.

          Show
          Mark Miller added a comment - yeah, with 32 watchers and 10 votes, some doc might be appreciated Tough to find out how to set this up by looking through all these jira comments. We need some tips/warnings about how this works too - how much RAM it might take on a large index (eg warn that it could be a lot), how to reduce that if you need to (threshold), and how the autocomplete index is not preserved over core reloads, etc.
          Hide
          Andrzej Bialecki added a comment -

          Thanks for prodding I'll create a wiki page with the docs.

          Show
          Andrzej Bialecki added a comment - Thanks for prodding I'll create a wiki page with the docs.
          Hide
          Grant Ingersoll added a comment -

          FWIW, much of the docs are the same as the SpellCheckComponent. In fact, I'm planning on implementing a Related Search capability, and it occurs to me that Related Search, Spelling and AutoSuggest are all just variants of the same thing (given a Query, give me suggestions for pertinent things around that query), so I'm going to refactor things a little bit to reflect that. There shouldn't be any functionality change. The only thing I was contemplating changing was in the config the <spellcheck> tag be renamed as <suggestion> (it would still accept the old, too). Code-wise, the gist of what I have in mind is that there will be a SuggestionComponent and SpellCheckComp will just be a empty, deprecated shell on that. Maybe would take SolrSpellChecker and rename that (or pull it up a level) as SolrSuggester. Everything should still work fine, but the nomenclature is better, I think.

          Given the above, I think the docs could then be reworked to show the various "suggestion" paths one might take: auto, spelling, related, other and how they can be implemented and reuse the same input params and output responses.

          Show
          Grant Ingersoll added a comment - FWIW, much of the docs are the same as the SpellCheckComponent. In fact, I'm planning on implementing a Related Search capability, and it occurs to me that Related Search, Spelling and AutoSuggest are all just variants of the same thing (given a Query, give me suggestions for pertinent things around that query), so I'm going to refactor things a little bit to reflect that. There shouldn't be any functionality change. The only thing I was contemplating changing was in the config the <spellcheck> tag be renamed as <suggestion> (it would still accept the old, too). Code-wise, the gist of what I have in mind is that there will be a SuggestionComponent and SpellCheckComp will just be a empty, deprecated shell on that. Maybe would take SolrSpellChecker and rename that (or pull it up a level) as SolrSuggester. Everything should still work fine, but the nomenclature is better, I think. Given the above, I think the docs could then be reworked to show the various "suggestion" paths one might take: auto, spelling, related, other and how they can be implemented and reuse the same input params and output responses.
          Hide
          Andrzej Bialecki added a comment -

          Sounds like a good plan. Let's track this in a separate issue - this one is bloated enough as it is. FWIW, I created http://wiki.apache.org/solr/Suggester to cover the current functionality, we can reuse it later too for the merged docs.

          Show
          Andrzej Bialecki added a comment - Sounds like a good plan. Let's track this in a separate issue - this one is bloated enough as it is. FWIW, I created http://wiki.apache.org/solr/Suggester to cover the current functionality, we can reuse it later too for the merged docs.
          Hide
          Yonik Seeley added a comment -

          Thanks for the docs Andrzej! I was able to get it working using the configuration on that wiki page, after
          changing threshold to a float:
          <float name="threshold">0.005</float>
          and changing the "field" to something in the example data:
          <str name="field">name</str>

          Autosuggest is a desirable enough feature for people that we should consider adding some config for it to the example schema,
          and adding a "QuickStart" to the wiki page so that people can immediately be successful with it.

          Random Q: do we have a capability that can mimic google's suggest yet (i.e. both single and multi-word suggestions at the same time)? I assume that maybe adding a field-type that shingles may be enough? If so, we should add that to the example server (hopefully on some field that won't negatively impact performance).

          Show
          Yonik Seeley added a comment - Thanks for the docs Andrzej! I was able to get it working using the configuration on that wiki page, after changing threshold to a float: <float name="threshold">0.005</float> and changing the "field" to something in the example data: <str name="field">name</str> Autosuggest is a desirable enough feature for people that we should consider adding some config for it to the example schema, and adding a "QuickStart" to the wiki page so that people can immediately be successful with it. Random Q: do we have a capability that can mimic google's suggest yet (i.e. both single and multi-word suggestions at the same time)? I assume that maybe adding a field-type that shingles may be enough? If so, we should add that to the example server (hopefully on some field that won't negatively impact performance).
          Hide
          Grant Ingersoll added a comment -

          It's baked into the VelocityRespWriter example at /solr/browse (although I think it currently uses browse).

          FWIW, I think we should make /solr/browse the default example/tutorial. It is really great for getting up and going and showing people a real search user interface.

          Show
          Grant Ingersoll added a comment - It's baked into the VelocityRespWriter example at /solr/browse (although I think it currently uses browse). FWIW, I think we should make /solr/browse the default example/tutorial. It is really great for getting up and going and showing people a real search user interface.
          Hide
          Silvan Zurbruegg added a comment -

          Is there a possibility to have minimal query-logic when looking up suggestions with the suggester?
          Afaik, all syntax like 'AND' or 'OR' is stripped by the QueryConverter. From playing around with the suggester i figured out that looking up multiple terms is possible, resulting in a 'OR'-like list for each term. Is it possible to have support for an 'AND' logic as well? For example looking up the terms '+css' and '+web' would yield in suggestions from documents that contain both terms. The first thing coming to mind are TermVectors, but since term vectors are document based and probably not related to the suggester-dictionary this could get complicated and is probably out of the scope of the suggester component.

          Show
          Silvan Zurbruegg added a comment - Is there a possibility to have minimal query-logic when looking up suggestions with the suggester? Afaik, all syntax like 'AND' or 'OR' is stripped by the QueryConverter. From playing around with the suggester i figured out that looking up multiple terms is possible, resulting in a 'OR'-like list for each term. Is it possible to have support for an 'AND' logic as well? For example looking up the terms '+css' and '+web' would yield in suggestions from documents that contain both terms. The first thing coming to mind are TermVectors, but since term vectors are document based and probably not related to the suggester-dictionary this could get complicated and is probably out of the scope of the suggester component.
          Hide
          Abhay Dabholkar added a comment -

          I have a file which has words, phrases in it.

          I was wondering how to make following possible.

          file has
          -------------
          rebate form
          form

          when i look for "form" or even "for" i would like to have rebate form to be included too.
          I tried using
          <str name="lookupImpl">org.apache.solr.spelling.suggest.jaspell.JaspellLookup</str>
          but no luck, wiki suggests some one liner change to get fuzzy suggestions.

          how would i configure this? and if changes need to be made to JaspellLookup what that would be?

          Show
          Abhay Dabholkar added a comment - I have a file which has words, phrases in it. I was wondering how to make following possible. file has ------------- rebate form form when i look for "form" or even "for" i would like to have rebate form to be included too. I tried using <str name="lookupImpl">org.apache.solr.spelling.suggest.jaspell.JaspellLookup</str> but no luck, wiki suggests some one liner change to get fuzzy suggestions. how would i configure this? and if changes need to be made to JaspellLookup what that would be?
          Hide
          Andrzej Bialecki added a comment -

          This issue is closed. Please discuss the functionality on the mailing lists, or if there are bugs / missing features then please create a new JIRA issue.

          Show
          Andrzej Bialecki added a comment - This issue is closed. Please discuss the functionality on the mailing lists, or if there are bugs / missing features then please create a new JIRA issue.
          Hide
          Robert Muir added a comment -

          Andrzej, could you review the files added by this commit, and see if any need Apache 2 license headers?
          From the comments, it seems some of it comes from jaspell, so I am hesitant to apply any headers...

          [rat:report]   C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/BufferingTermFreqIteratorWrapper.java
          [rat:report]   C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/Lookup.java
          [rat:report]   C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/SortedTermFreqIteratorWrapper.java
          [rat:report]   C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/UnsortedTermFreqIteratorWrapper.java
          [rat:report]   C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/jaspell/JaspellLookup.java
          [rat:report]   C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/jaspell/JaspellTernarySearchTrie.java
          [rat:report]   C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/tst/TSTAutocomplete.java
          [rat:report]   C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/tst/TSTLookup.java
          [rat:report]   C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/tst/TernaryTreeNode.java
          
          Show
          Robert Muir added a comment - Andrzej, could you review the files added by this commit, and see if any need Apache 2 license headers? From the comments, it seems some of it comes from jaspell, so I am hesitant to apply any headers... [rat:report] C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/BufferingTermFreqIteratorWrapper.java [rat:report] C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/Lookup.java [rat:report] C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/SortedTermFreqIteratorWrapper.java [rat:report] C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/UnsortedTermFreqIteratorWrapper.java [rat:report] C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/jaspell/JaspellLookup.java [rat:report] C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/jaspell/JaspellTernarySearchTrie.java [rat:report] C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/tst/TSTAutocomplete.java [rat:report] C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/tst/TSTLookup.java [rat:report] C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/spelling/suggest/tst/TernaryTreeNode.java
          Hide
          Robert Muir added a comment -

          Also, some files in solr.util, too:

          [rat:report]   C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/util/SortedIterator.java
          [rat:report]   C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/util/TermFreqIterator.java
          
          Show
          Robert Muir added a comment - Also, some files in solr.util, too: [rat:report] C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/util/SortedIterator.java [rat:report] C:/Users/rmuir/workspace/lucene/solr/src/java/org/apache/solr/util/TermFreqIterator.java
          Hide
          David Smiley added a comment -

          Hoss, you inadvertently renamed the title from "Create autosuggest component" to the number 3. Please put it back.

          Show
          David Smiley added a comment - Hoss, you inadvertently renamed the title from "Create autosuggest component" to the number 3. Please put it back.
          Hide
          Uwe Schindler added a comment -

          Bulk close after release of 3.1

          Show
          Uwe Schindler added a comment - Bulk close after release of 3.1

            People

            • Assignee:
              Andrzej Bialecki
              Reporter:
              Jason Rutherglen
            • Votes:
              10 Vote for this issue
              Watchers:
              33 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 96h
                96h
                Remaining:
                Remaining Estimate - 96h
                96h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development