Solr
  1. Solr
  2. SOLR-572

Spell Checker as a Search Component

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.3
    • Component/s: spellchecker
    • Labels:
      None

      Description

      http://wiki.apache.org/solr/SpellCheckComponent

      Expose the Lucene contrib SpellChecker as a Search Component. Provide the following features:

      • Allow creating a spell index on a given field and make it possible to have multiple spell indices – one for each field
      • Give suggestions on a per-field basis
      • Given a multi-word query, give only one consistent suggestion
      • Process the query with the same analyzer specified for the source field and process each token separately
      • Allow the user to specify minimum length for a token (optional)

      Consistency criteria for a multi-word query can consist of the following:

      • Preserve the correct words in the original query as it is
      • Never give duplicate words in a suggestion
      1. SOLR-572.patch
        11 kB
        Shalin Shekhar Mangar
      2. SOLR-572.patch
        12 kB
        Bojan Smid
      3. SOLR-572.patch
        15 kB
        Shalin Shekhar Mangar
      4. SOLR-572.patch
        17 kB
        Bojan Smid
      5. SOLR-572.patch
        23 kB
        Shalin Shekhar Mangar
      6. SOLR-572.patch
        34 kB
        Grant Ingersoll
      7. SOLR-572.patch
        34 kB
        Grant Ingersoll
      8. SOLR-572.patch
        46 kB
        Grant Ingersoll
      9. SOLR-572.patch
        53 kB
        Grant Ingersoll
      10. SOLR-572.patch
        56 kB
        Shalin Shekhar Mangar
      11. SOLR-572.patch
        61 kB
        Grant Ingersoll
      12. SOLR-572.patch
        71 kB
        Grant Ingersoll
      13. SOLR-572.patch
        71 kB
        Grant Ingersoll
      14. SOLR-572.patch
        71 kB
        Shalin Shekhar Mangar
      15. SOLR-572.patch
        72 kB
        Shalin Shekhar Mangar
      16. SOLR-572.patch
        76 kB
        Grant Ingersoll
      17. SOLR-572.patch
        80 kB
        Grant Ingersoll
      18. SOLR-572.patch
        79 kB
        Grant Ingersoll
      19. SOLR-572.patch
        79 kB
        Grant Ingersoll
      20. SOLR-572.patch
        84 kB
        Grant Ingersoll
      21. SOLR-572.patch
        84 kB
        Grant Ingersoll
      22. SOLR-572.patch
        84 kB
        Grant Ingersoll
      23. SOLR-572.patch
        88 kB
        Grant Ingersoll
      24. SOLR-572.patch
        91 kB
        Grant Ingersoll
      25. SOLR-572.patch
        92 kB
        Grant Ingersoll
      26. SOLR-572.patch
        92 kB
        Shalin Shekhar Mangar
      27. solr-572.patch
        3 kB
        Bojan Smid

        Issue Links

          Activity

          Hide
          Shalin Shekhar Mangar added a comment -

          Linked to SOLR-507 - Spell Checking Improvements.

          Show
          Shalin Shekhar Mangar added a comment - Linked to SOLR-507 - Spell Checking Improvements.
          Hide
          Shalin Shekhar Mangar added a comment -

          A first cut for this issue. Please consider this as work in progress. I've posted this to get feedback on the approach and syntax.

          The contains the following:

          • SpellCheckComponent is an implementation of SearchComponent
          • The configuration is specified in solrconfig.xml with multiple "dictionary" nodes. Each dictionary must have a name and a type. The name must be specified during query time. The type is needed to allow for more than one way of loading data into the spell index (solr field or file). For example:
            <searchComponent name="spellcheck" class="org.apache.solr.handler.component.SpellCheckComponent">
            	<lst name="dictionary">
            		<str name="name">default</str>
            		<str name="type">solr</str>
            		<str name="field">word</str>
            		<str name="indexDir">c:/temp/spellindex</str>
            	</lst>
            	<lst name="dictionary">
            		<str name="name">external</str>
            		<str name="type">file</str>
            		<str name="path">spellings.txt</str>
            	</lst>
            </searchComponent>
            
          • If indexDir is not present in the dictionary's configuration then a RAMDirectory is used, otherwise a FSDirectory is used.
          • This patch supports dictionaries loaded from Solr fields.
          • A separate Lucene SpellChecker is created for each configured dictionary
          • Sample query syntax is as follows:
            • /select/?q=aura&version=2.2&start=0&rows=10&indent=on&spellcheck=true&spellcheck.dictionary=default&spellcheck.count=10
            • /select/?q=toyata&version=2.2&start=0&rows=10&indent=on&spellcheck=true&spellcheck.dictionary=default
          • The value for "q" is analyzed with the Solr field's query analyzer. Suggestions for each token are fetched separately.
          • Only one suggestion for a query is given by default. This should be used for multi-token queries.
          • If spellcheck.count is specified then the response has a number of suggestions <= spellcheck.count for each token separately.
          • Only unique words are returned in the suggestions.

          Things to be done:

          • Add JUnit tests
          • Reloading dictionaries. Currently the dictionary is loaded only once during the first request.
          • Make things more configurable like SpellCheckerRequestHandler
          • Add support for onlyMorePopular flag as in SpellCheckerRequestHandler
          Show
          Shalin Shekhar Mangar added a comment - A first cut for this issue. Please consider this as work in progress. I've posted this to get feedback on the approach and syntax. The contains the following: SpellCheckComponent is an implementation of SearchComponent The configuration is specified in solrconfig.xml with multiple "dictionary" nodes. Each dictionary must have a name and a type. The name must be specified during query time. The type is needed to allow for more than one way of loading data into the spell index (solr field or file). For example: <searchComponent name= "spellcheck" class= "org.apache.solr.handler.component.SpellCheckComponent" > <lst name= "dictionary" > <str name= "name" > default </str> <str name= "type" > solr </str> <str name= "field" > word </str> <str name= "indexDir" > c:/temp/spellindex </str> </lst> <lst name= "dictionary" > <str name= "name" > external </str> <str name= "type" > file </str> <str name= "path" > spellings.txt </str> </lst> </searchComponent> If indexDir is not present in the dictionary's configuration then a RAMDirectory is used, otherwise a FSDirectory is used. This patch supports dictionaries loaded from Solr fields. A separate Lucene SpellChecker is created for each configured dictionary Sample query syntax is as follows: /select/?q=aura&version=2.2&start=0&rows=10&indent=on&spellcheck=true&spellcheck.dictionary=default&spellcheck.count=10 /select/?q=toyata&version=2.2&start=0&rows=10&indent=on&spellcheck=true&spellcheck.dictionary=default The value for "q" is analyzed with the Solr field's query analyzer. Suggestions for each token are fetched separately. Only one suggestion for a query is given by default. This should be used for multi-token queries. If spellcheck.count is specified then the response has a number of suggestions <= spellcheck.count for each token separately. Only unique words are returned in the suggestions. Things to be done: Add JUnit tests Reloading dictionaries. Currently the dictionary is loaded only once during the first request. Make things more configurable like SpellCheckerRequestHandler Add support for onlyMorePopular flag as in SpellCheckerRequestHandler
          Hide
          Noble Paul added a comment -
          • the spellcheck.dictionary=default must be optional in query. The user must be able to name a dictionary as 'default' and that can be used as the default if no value is passed.
          Show
          Noble Paul added a comment - the spellcheck.dictionary=default must be optional in query. The user must be able to name a dictionary as 'default' and that can be used as the default if no value is passed.
          Hide
          Otis Gospodnetic added a comment -

          I had a quick look and it all looks nice and clean.
          I like the config, though I think "solr" is too specific - the source field could be in a vanilla Lucene indexthat lives somewhere on disk, or example. Thus, I'd change "solr" to "index". Oh, I see, you are reading field values from the index of the current core. I think that is fine, but wouldn't it also be good to be able to read field values from a vanilla Lucene index? (but you wouldn't know the field type and thus would not be able to get the Analyzer for the field)

          Also, and regardless of the above, instead of having "indexDir" and "path", why not call them both "location" and maybe even let them include the file: schema for consistency, if it works with the code that uses those locations?

          Also on TODO:

          • Read dictionary from plain-text files.
          Show
          Otis Gospodnetic added a comment - I had a quick look and it all looks nice and clean. I like the config, though I think "solr" is too specific - the source field could be in a vanilla Lucene indexthat lives somewhere on disk, or example. Thus, I'd change "solr" to "index". Oh, I see, you are reading field values from the index of the current core. I think that is fine, but wouldn't it also be good to be able to read field values from a vanilla Lucene index? (but you wouldn't know the field type and thus would not be able to get the Analyzer for the field) Also, and regardless of the above, instead of having "indexDir" and "path", why not call them both "location" and maybe even let them include the file: schema for consistency, if it works with the code that uses those locations? Also on TODO: Read dictionary from plain-text files.
          Hide
          Shalin Shekhar Mangar added a comment -

          Otis, I agree that we should call "index' instead of "solr" for the type and "path" can be renamed to "location". But indexDir refers to the target for the spell check index whereas "path" currently refers to the source of the dictionary, so IMHO we should keep "indexDir" as it is (It can also be a relative path).

          For supporting arbitrary lucene indices, user must specify type="index", field="fieldName", location="path/to/lucene/index/directory" which should be enough (TODO). In that case the analyzer can be fixed as something (say WhitespaceAnalyzer or StandardAnalyzer).

          I'm not sure I understand your comment on the schema. If this is for text files then I was thinking more about having a text file which would have one word per line and all those words would go into the same dictionary.

          Show
          Shalin Shekhar Mangar added a comment - Otis, I agree that we should call "index' instead of "solr" for the type and "path" can be renamed to "location". But indexDir refers to the target for the spell check index whereas "path" currently refers to the source of the dictionary, so IMHO we should keep "indexDir" as it is (It can also be a relative path). For supporting arbitrary lucene indices, user must specify type="index", field="fieldName", location="path/to/lucene/index/directory" which should be enough (TODO). In that case the analyzer can be fixed as something (say WhitespaceAnalyzer or StandardAnalyzer). I'm not sure I understand your comment on the schema. If this is for text files then I was thinking more about having a text file which would have one word per line and all those words would go into the same dictionary.
          Hide
          Otis Gospodnetic added a comment -

          I see (indexDir comment). Might be better to make it more obvious then - "sourceIndex" for the Lucene index that serves as the source of data) vs. "targetIndex" (or "spellcheckerIndex") for the resulting spellchecker index.

          For Lucene indices to be used as sources of data type="index", field="fieldName", location="path/to/lucene/index/directory" makes sense.

          Ignore my comment about the schema, I'm just complicating things with that. Yes, one word per line for plain-text file data sources - that can easily be digested with PlainTextDictionary class (part of Lucene SC).

          Show
          Otis Gospodnetic added a comment - I see (indexDir comment). Might be better to make it more obvious then - "sourceIndex" for the Lucene index that serves as the source of data) vs. "targetIndex" (or "spellcheckerIndex") for the resulting spellchecker index. For Lucene indices to be used as sources of data type="index", field="fieldName", location="path/to/lucene/index/directory" makes sense. Ignore my comment about the schema, I'm just complicating things with that. Yes, one word per line for plain-text file data sources - that can easily be digested with PlainTextDictionary class (part of Lucene SC).
          Hide
          Bojan Smid added a comment -

          I added support for file-based dictionaries (they are configured as described in Shalin's post) using Lucene's PlainTextDictionary.

          However, I had to add property "field" to the configuration for this dictionary in order to obtain analyzer (which is passed to FieldSpellChecker). This analyzer is later used to extract tokens from the query.

          I guess my current solution is not quite correct (since PlainTextDictionary doesn't really need analyzer), but it also makes me wonder if in case of dictionary built from solr index, same analyzer should be used when building dictionary and parsing query strings?

          Show
          Bojan Smid added a comment - I added support for file-based dictionaries (they are configured as described in Shalin's post) using Lucene's PlainTextDictionary. However, I had to add property "field" to the configuration for this dictionary in order to obtain analyzer (which is passed to FieldSpellChecker). This analyzer is later used to extract tokens from the query. I guess my current solution is not quite correct (since PlainTextDictionary doesn't really need analyzer), but it also makes me wonder if in case of dictionary built from solr index, same analyzer should be used when building dictionary and parsing query strings?
          Hide
          Noble Paul added a comment - - edited

          Adding a 'field' attribute is not intuitive. If your data needs custom analyzers create an extra 'type' in the schema and let us dd an extra attribute 'dataType' eg:

          <str name="dataType">my_new_data_type</str>
          
          Show
          Noble Paul added a comment - - edited Adding a 'field' attribute is not intuitive. If your data needs custom analyzers create an extra 'type' in the schema and let us dd an extra attribute 'dataType' eg: <str name= "dataType" > my_new_data_type </str>
          Hide
          Grant Ingersoll added a comment -

          Patch applies cleanly. Very cool that we have something concrete finally

          Some thoughts:
          1. I don't believe we use author tags (is this a Solr policy? I know it is a Lucene Java convention)
          2. There needs to be unit tests
          3. I think it makes sense to have the option to return extended results
          4. I don't think it should be a default search component, but will defer to others.
          5. numFound should be returned when count > 1 as well, right? In other words, the structure should be the same for the response no matter what in:

          if (count > 1) {
                  response.add("suggestions", spellChecker.getSuggestions(q, count));
                } else {
                  NamedList suggestions = new NamedList();
                  suggestions.add("numFound", 1);
                  suggestions.add(q, spellChecker.getSuggestion(q));
                  response.add("suggestions", suggestions);
                }
          

          That way it can be handled uniformly on the client

          Show
          Grant Ingersoll added a comment - Patch applies cleanly. Very cool that we have something concrete finally Some thoughts: 1. I don't believe we use author tags (is this a Solr policy? I know it is a Lucene Java convention) 2. There needs to be unit tests 3. I think it makes sense to have the option to return extended results 4. I don't think it should be a default search component, but will defer to others. 5. numFound should be returned when count > 1 as well, right? In other words, the structure should be the same for the response no matter what in: if (count > 1) { response.add( "suggestions" , spellChecker.getSuggestions(q, count)); } else { NamedList suggestions = new NamedList(); suggestions.add( "numFound" , 1); suggestions.add(q, spellChecker.getSuggestion(q)); response.add( "suggestions" , suggestions); } That way it can be handled uniformly on the client
          Hide
          Bojan Smid added a comment -

          The "field" attribute for file-based dictionary is basically the same "field" attribute as in default dictionary (in both cases they are used to obtain query analyzer), so that is the reason why I used the same name. My question was is it ok for default dictionary to use the same field to build dictionary from solr index and to obtain query analyzer for extracting tokens?

          Show
          Bojan Smid added a comment - The "field" attribute for file-based dictionary is basically the same "field" attribute as in default dictionary (in both cases they are used to obtain query analyzer), so that is the reason why I used the same name. My question was is it ok for default dictionary to use the same field to build dictionary from solr index and to obtain query analyzer for extracting tokens?
          Hide
          Shalin Shekhar Mangar added a comment -

          Bojan – Thanks for adding this functionality. I'll work on making things more configurable like SCRH and add a few tests. I think it is OK and may even be needed for a few cases. Though I prefer Noble's suggestion on having fieldType instead of field since it gives more freedom to the user.

          Grant – Thanks for looking into the patch. My comments below:

          1. Right, those were generated by my IDE, I'll remove it in the next patch
          2. Agree
          3. Agree, both 2 and 3 are on my todo list
          4. I don't understand what you mean by "defer to others" but on making this default or not, I'm fine either way.
          5. Actually, the spellChecker.getSuggestion(q, count) returns a complete named list, which already has the numFound element. If you don't specify the count, then it gives back only a String for which we need to create a NamedList ourselves. In other words, the response format is actually the same both ways.

          Noble – I your suggestion on keeping a fieldType attribute in the configuration for non-Solr dictionaries. We can use the QueryAnalyzer defined for the given fieldType in Solr's schema. If this attribute is not present, we can default to WhitespaceAnalyzer or StandardAnalyzer.

          Show
          Shalin Shekhar Mangar added a comment - Bojan – Thanks for adding this functionality. I'll work on making things more configurable like SCRH and add a few tests. I think it is OK and may even be needed for a few cases. Though I prefer Noble's suggestion on having fieldType instead of field since it gives more freedom to the user. Grant – Thanks for looking into the patch. My comments below: Right, those were generated by my IDE, I'll remove it in the next patch Agree Agree, both 2 and 3 are on my todo list I don't understand what you mean by "defer to others" but on making this default or not, I'm fine either way. Actually, the spellChecker.getSuggestion(q, count) returns a complete named list, which already has the numFound element. If you don't specify the count, then it gives back only a String for which we need to create a NamedList ourselves. In other words, the response format is actually the same both ways. Noble – I your suggestion on keeping a fieldType attribute in the configuration for non-Solr dictionaries. We can use the QueryAnalyzer defined for the given fieldType in Solr's schema. If this attribute is not present, we can default to WhitespaceAnalyzer or StandardAnalyzer.
          Hide
          Grant Ingersoll added a comment -

          I don't understand what you mean by "defer to others" but on making this default or not, I'm fine either way.

          Just meaning, I'm not the only one who has a say in whether or not is a default component. My guess is not everyone will want it in the default list of components.

          Very cool on the other stuff.

          One other thing to think about: What if we want a different underlying spell checker? The Lucene spell checker approach isn't exactly state of the art as far as I understand it. Obviously not your concern at the moment, but might be good to think about the ability to interchange the underlying implementation by abstracting the notion of spelling a bit while still maintaining the same search component interface.

          Show
          Grant Ingersoll added a comment - I don't understand what you mean by "defer to others" but on making this default or not, I'm fine either way. Just meaning, I'm not the only one who has a say in whether or not is a default component. My guess is not everyone will want it in the default list of components. Very cool on the other stuff. One other thing to think about: What if we want a different underlying spell checker? The Lucene spell checker approach isn't exactly state of the art as far as I understand it. Obviously not your concern at the moment, but might be good to think about the ability to interchange the underlying implementation by abstracting the notion of spelling a bit while still maintaining the same search component interface.
          Hide
          Otis Gospodnetic added a comment -

          Grant - I agree it would be nice. But let's get this one in first. Perhaps you can add that idea to the list in SOLR-507.

          Show
          Otis Gospodnetic added a comment - Grant - I agree it would be nice. But let's get this one in first. Perhaps you can add that idea to the list in SOLR-507 .
          Hide
          Shalin Shekhar Mangar added a comment -

          Grant - I was trying to implement the onlyMorePopular and extendedResults format of SCRH when I realized that supporting such a response is not possible for text file based dictionaries in the current implementation. Currently, we use Lucene's PlainTextDictionary to load such text files and we don't maintain any frequency information. What do you suggest?

          Bojan/Otis - The terms loaded from the text files are passed onto Lucene's SpellChecker as it is. As per Noble's suggestion, I've added support for a optional fieldType attribute (this type must be defined in schema.xml). This type's query analyzer is used for queries. Wouldn't it be more consistent to apply the index-analyzer during index time also?

          Both the above problems can be solved if we keep the words loaded from the text files in a Lucene index but I'm not sure if we want to go that way.

          Show
          Shalin Shekhar Mangar added a comment - Grant - I was trying to implement the onlyMorePopular and extendedResults format of SCRH when I realized that supporting such a response is not possible for text file based dictionaries in the current implementation. Currently, we use Lucene's PlainTextDictionary to load such text files and we don't maintain any frequency information. What do you suggest? Bojan/Otis - The terms loaded from the text files are passed onto Lucene's SpellChecker as it is. As per Noble's suggestion, I've added support for a optional fieldType attribute (this type must be defined in schema.xml). This type's query analyzer is used for queries. Wouldn't it be more consistent to apply the index-analyzer during index time also? Both the above problems can be solved if we keep the words loaded from the text files in a Lucene index but I'm not sure if we want to go that way.
          Hide
          Otis Gospodnetic added a comment -

          Shalin
          I think the onlyMorePopular and extendedResults should be optional, so in case of plain text dictionaries this information would just not be present if we cannot derive it. Even if we take words from plain text files and index them into a Lucene index their frequency will remain 1.

          Does the index-time analyzer make sense? I don't have the sources handy, but doesn't Lucene SC take the input word and chop it up into 2- and 3-grams before indexing? If so, how would index-time analyzer come into play?

          In principal, if taking plain text files and indexing words in them into a Lucene SC index solves problems, I think that's acceptable - such indices are likely to be relatively small, so they should be quick to build and not require a lot of memory.

          Show
          Otis Gospodnetic added a comment - Shalin I think the onlyMorePopular and extendedResults should be optional, so in case of plain text dictionaries this information would just not be present if we cannot derive it. Even if we take words from plain text files and index them into a Lucene index their frequency will remain 1. Does the index-time analyzer make sense? I don't have the sources handy, but doesn't Lucene SC take the input word and chop it up into 2- and 3-grams before indexing? If so, how would index-time analyzer come into play? In principal, if taking plain text files and indexing words in them into a Lucene SC index solves problems, I think that's acceptable - such indices are likely to be relatively small, so they should be quick to build and not require a lot of memory.
          Hide
          Shalin Shekhar Mangar added a comment -

          Ok, onlyMorePopular and extendedResults will only be supported for dictionaries built from Solr fields.

          Yes, the Lucene SpellChecker does create n-grams but think about lowercasing, stemming etc. All this analysis can potentially change the word which eventually gets n-grammed by Lucene.

          Show
          Shalin Shekhar Mangar added a comment - Ok, onlyMorePopular and extendedResults will only be supported for dictionaries built from Solr fields. Yes, the Lucene SpellChecker does create n-grams but think about lowercasing, stemming etc. All this analysis can potentially change the word which eventually gets n-grammed by Lucene.
          Hide
          Bojan Smid added a comment -

          I would like to add support for different character encodings in file-based dictionaries (current implementation will take system's default settings). I'm not sure how we'll synchronize your work with my fix? Can you let me know when/how can I start my work?

          Show
          Bojan Smid added a comment - I would like to add support for different character encodings in file-based dictionaries (current implementation will take system's default settings). I'm not sure how we'll synchronize your work with my fix? Can you let me know when/how can I start my work?
          Hide
          Shalin Shekhar Mangar added a comment -

          A new patch containing the following changes:

          1. type="solr" is now known as type="index"
          2. path is now called location
          3. Relative paths are supported. They are loaded through SolrResourceLoader.openResource method.
          4. Dictionaries can be built on arbitary Lucene indices
          5. indexDir is now called spellcheckIndexDir to clearly highlight it's purpose
          6. Dictionaries loaded from a text file can have a fieldType attribute. The analyzer of this fieldType is used at query time. If no fieldType is specified then WhitespaceAnalyzer is used.
          7. For dictionaries loaded from a text file, if fieldType is specified then index-time analysis is done using the given fieldType's analyzer
          Show
          Shalin Shekhar Mangar added a comment - A new patch containing the following changes: type="solr" is now known as type="index" path is now called location Relative paths are supported. They are loaded through SolrResourceLoader.openResource method. Dictionaries can be built on arbitary Lucene indices indexDir is now called spellcheckIndexDir to clearly highlight it's purpose Dictionaries loaded from a text file can have a fieldType attribute. The analyzer of this fieldType is used at query time. If no fieldType is specified then WhitespaceAnalyzer is used. For dictionaries loaded from a text file, if fieldType is specified then index-time analysis is done using the given fieldType's analyzer
          Hide
          Shalin Shekhar Mangar added a comment -

          Bojan – I don't want to hold you up so I've uploaded the current state of my work. Please go ahead with your changes. I can continue after you're done.

          Another issue I noticed with the SCRH is that it accepts the accuracy as a request parameter and calls Lucene SpellChecker.setAccuracy before getting the suggestion. However, this is neither thread-safe nor can we guarantee that the accuracy is actually enforced for the suggestion. Therefore, I think we should only have accuracy configurable in the solrconfig.xml and not as a request parameter.

          Show
          Shalin Shekhar Mangar added a comment - Bojan – I don't want to hold you up so I've uploaded the current state of my work. Please go ahead with your changes. I can continue after you're done. Another issue I noticed with the SCRH is that it accepts the accuracy as a request parameter and calls Lucene SpellChecker.setAccuracy before getting the suggestion. However, this is neither thread-safe nor can we guarantee that the accuracy is actually enforced for the suggestion. Therefore, I think we should only have accuracy configurable in the solrconfig.xml and not as a request parameter.
          Hide
          Bojan Smid added a comment - - edited

          Character encodings for file-based dictionaries now supported with property characterEncoding. So, configuration for such dictionary would look like this:

          <lst name="dictionary">
          		<str name="name">external</str>
          		<str name="type">file</str>
          		<str name="sourceLocation">spellings.txt</str>
          		<str name="characterEncoding">UTF-8</str>
          		<str name="spellcheckIndexDir ">c:\spellchecker</str>
          </lst>
          

          New code needs latest lucene-spellchecker-2.4*.jar from Lucene trunk.

          Since SolrResourceLoader.getLines method doesn't support configurable encodings (treats everything as UTF-8), I wasn't sure how to add that support. I could have added overloaded method to SolrResourceLoader, but there is a TODO comment, so I decided to create getLines() method inside SpellCheckComponent class instead. What do you think of this?

          Show
          Bojan Smid added a comment - - edited Character encodings for file-based dictionaries now supported with property characterEncoding. So, configuration for such dictionary would look like this: <lst name= "dictionary" > <str name= "name" > external </str> <str name= "type" > file </str> <str name= "sourceLocation" > spellings.txt </str> <str name= "characterEncoding" > UTF-8 </str> <str name= "spellcheckIndexDir " > c:\spellchecker </str> </lst> New code needs latest lucene-spellchecker-2.4*.jar from Lucene trunk. Since SolrResourceLoader.getLines method doesn't support configurable encodings (treats everything as UTF-8), I wasn't sure how to add that support. I could have added overloaded method to SolrResourceLoader, but there is a TODO comment, so I decided to create getLines() method inside SpellCheckComponent class instead. What do you think of this?
          Hide
          Oleg Gnatovskiy added a comment - - edited

          Hey guys I was just wondering if there is a way to get the suggestions not to echo the query if there are no suggestions available. For example, a query where q=food probably should not return a suggestion of "food".

          Show
          Oleg Gnatovskiy added a comment - - edited Hey guys I was just wondering if there is a way to get the suggestions not to echo the query if there are no suggestions available. For example, a query where q=food probably should not return a suggestion of "food".
          Hide
          Shalin Shekhar Mangar added a comment -

          Oleg – Thanks for trying out the patch. No, currently it does not signal if suggestions are not found, it just returns the query terms themselves. I'll add that feature.

          Show
          Shalin Shekhar Mangar added a comment - Oleg – Thanks for trying out the patch. No, currently it does not signal if suggestions are not found, it just returns the query terms themselves. I'll add that feature.
          Hide
          Oleg Gnatovskiy added a comment -

          Hey guys, I am having trouble creating a file-based dictionary.

          The file looks like this:

          american
          mexican
          clothes
          shoes

          and it is in my solr.home/conf directory.

          The solrConfig has the following: <searchComponent name="spellcheck" class="org.apache.solr.handler.component.SpellCheckComponent">
          <lst name="dictionary">
          <str name="name">external</str>
          <str name="type">file</str>
          <str name="sourceLocation">spellings.txt</str>
          <str name="characterEncoding">UTF-8</str>
          <str name="spellcheckIndexDir">/home/csweb/index</str>
          </lst>
          </searchComponent>

          I hit it with the following URL: http://localhost:8983/solr/select/?q=pizza&spellcheck=true&spellcheck.dictionary=external

          and I get the following stacktrace:

          SEVERE: java.lang.NullPointerException
          at org.apache.lucene.search.spell.SpellChecker.indexDictionary(SpellChecker.java:321)
          at org.apache.solr.handler.component.SpellCheckComponent$FieldSpellChecker.init(SpellCheckComponent.java:391)
          at org.apache.solr.handler.component.SpellCheckComponent.loadExternalFileDictionary(SpellCheckComponent.java:204)
          at org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:131)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:133)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:966)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:339)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:274)
          at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
          at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
          at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
          at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
          at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
          at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
          at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
          at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
          at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
          at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
          at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
          at java.lang.Thread.run(Thread.java:619)

          Any idea what I am doing wrong? Thanks!

          Show
          Oleg Gnatovskiy added a comment - Hey guys, I am having trouble creating a file-based dictionary. The file looks like this: american mexican clothes shoes and it is in my solr.home/conf directory. The solrConfig has the following: <searchComponent name="spellcheck" class="org.apache.solr.handler.component.SpellCheckComponent"> <lst name="dictionary"> <str name="name">external</str> <str name="type">file</str> <str name="sourceLocation">spellings.txt</str> <str name="characterEncoding">UTF-8</str> <str name="spellcheckIndexDir">/home/csweb/index</str> </lst> </searchComponent> I hit it with the following URL: http://localhost:8983/solr/select/?q=pizza&spellcheck=true&spellcheck.dictionary=external and I get the following stacktrace: SEVERE: java.lang.NullPointerException at org.apache.lucene.search.spell.SpellChecker.indexDictionary(SpellChecker.java:321) at org.apache.solr.handler.component.SpellCheckComponent$FieldSpellChecker.init(SpellCheckComponent.java:391) at org.apache.solr.handler.component.SpellCheckComponent.loadExternalFileDictionary(SpellCheckComponent.java:204) at org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:131) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:133) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125) at org.apache.solr.core.SolrCore.execute(SolrCore.java:966) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:339) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:274) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) Any idea what I am doing wrong? Thanks!
          Hide
          Otis Gospodnetic added a comment -

          Haven't looked at the code, but the first thing I'd try is using a full/absolute path to your dictionary file.

          Show
          Otis Gospodnetic added a comment - Haven't looked at the code, but the first thing I'd try is using a full/absolute path to your dictionary file.
          Hide
          Bojan Smid added a comment -

          I already found the same problem, made a fix and sent it to Shalin, he will incorporate it into next patch when it's ready. If you specify field "field type" for that dictionary (and that field type can be found in Solr schema), you'll avoid the problem for now.

          Show
          Bojan Smid added a comment - I already found the same problem, made a fix and sent it to Shalin, he will incorporate it into next patch when it's ready. If you specify field "field type" for that dictionary (and that field type can be found in Solr schema), you'll avoid the problem for now.
          Hide
          Otis Gospodnetic added a comment -

          Just got an idea. File-based dictionaries don't have word frequency information and with that we use certain value (e.g. so onlyMorePopular cannot be used). What if we (also) accepted plain-text field dictionaries that included word frequency information?
          e.g.
          ball,100
          boil,44
          bowl,77
          ...
          I'm not looking at sources now, but could we not feed this word frequency information into Lucene SC, so it makes use of that when figuring out top-N best words to suggest?

          And how would we figure out the frequency of each word to begin with? I imagine we can have a tool/class that, given a path to a dictionary file with words and a path to a Lucene/Solr index, looks up each dictionary word's frequency in the given index and outputs "<word>,<freq>" for each word. This class could live in Lucene SC, but could be used by SCRH when rebuilding the SC index for example.

          Does this sound useful and implementable?

          Show
          Otis Gospodnetic added a comment - Just got an idea. File-based dictionaries don't have word frequency information and with that we use certain value (e.g. so onlyMorePopular cannot be used). What if we (also) accepted plain-text field dictionaries that included word frequency information? e.g. ball,100 boil,44 bowl,77 ... I'm not looking at sources now, but could we not feed this word frequency information into Lucene SC, so it makes use of that when figuring out top-N best words to suggest? And how would we figure out the frequency of each word to begin with? I imagine we can have a tool/class that, given a path to a dictionary file with words and a path to a Lucene/Solr index, looks up each dictionary word's frequency in the given index and outputs "<word>,<freq>" for each word. This class could live in Lucene SC, but could be used by SCRH when rebuilding the SC index for example. Does this sound useful and implementable?
          Hide
          Oleg Gnatovskiy added a comment - - edited

          Bojan, do you mean adding something like <str name="field">word</str> to the definition for the file-based dictionary?

          Show
          Oleg Gnatovskiy added a comment - - edited Bojan, do you mean adding something like <str name="field">word</str> to the definition for the file-based dictionary?
          Hide
          Bojan Smid added a comment -

          Oleg, that field is now called fieldType, so something like <str name="fieldType">word</str> should work for you as long as you have fileType with name word defined in your schema.xml. Let me know if this works.

          Show
          Bojan Smid added a comment - Oleg, that field is now called fieldType, so something like <str name="fieldType">word</str> should work for you as long as you have fileType with name word defined in your schema.xml. Let me know if this works.
          Hide
          Bojan Smid added a comment -

          I noticed that when searching for suggestion for a word which exists in dictionary, SC returns some similar word instead of returning that same word. Old SCRH had field "exist" which returned true if word exists in the dictionary (so the client can treat it as correct word that doesn't need suggestion).

          We can't have exactly the same functionality here (since "multi-word" queries should be supported), but we can make SC return field "spellingCorrect" in case all words from the query exist in the dictionary. Otherwise, there is no way to know if spelling was correct or we should display suggestion.

          There is a method in Lucene's SC to check if word exists in the index, so it's easy to check if word is correct. However, I'm also thinking of situation when we don't have just simple words in the query, for instance : "toyata AND miles:[1 to 10000]", we want to check just toyata in the index, and return suggestion "toyota AND miles:[1 to 10000]". Other query types which might pose a problem are:

          • fuzzy query
          • wildcard query
          • prefix query
            ...
          Show
          Bojan Smid added a comment - I noticed that when searching for suggestion for a word which exists in dictionary, SC returns some similar word instead of returning that same word. Old SCRH had field "exist" which returned true if word exists in the dictionary (so the client can treat it as correct word that doesn't need suggestion). We can't have exactly the same functionality here (since "multi-word" queries should be supported), but we can make SC return field "spellingCorrect" in case all words from the query exist in the dictionary. Otherwise, there is no way to know if spelling was correct or we should display suggestion. There is a method in Lucene's SC to check if word exists in the index, so it's easy to check if word is correct. However, I'm also thinking of situation when we don't have just simple words in the query, for instance : "toyata AND miles: [1 to 10000] ", we want to check just toyata in the index, and return suggestion "toyota AND miles: [1 to 10000] ". Other query types which might pose a problem are: fuzzy query wildcard query prefix query ...
          Hide
          Oleg Gnatovskiy added a comment -

          Yes, I've actually run into that problem too. Do you think this is something that you will be able to solve?

          Show
          Oleg Gnatovskiy added a comment - Yes, I've actually run into that problem too. Do you think this is something that you will be able to solve?
          Hide
          Bojan Smid added a comment - - edited

          Sure. A quick fix can be done easily, but it probably wouldn't cover all possibilities, hence my post...

          Show
          Bojan Smid added a comment - - edited Sure. A quick fix can be done easily, but it probably wouldn't cover all possibilities, hence my post...
          Hide
          Grant Ingersoll added a comment - - edited

          OK, I'm working on this.

          Some thoughts:
          1. Why is the initialization done in prepare? Just to be a little more lazy than in init?

          2. In FieldSpellChecker, the getSuggestion method goes through and creates the suggested map, but then the loop over the entry set at the end only uses the value. I think our response should return the associated correction with the original token.

          3. I'm working on the abstraction notion. The goal is to have a common response, no matter the spell checker, so that we can plug and play spell checkers. I hope to have a patch soon.

          Show
          Grant Ingersoll added a comment - - edited OK, I'm working on this. Some thoughts: 1. Why is the initialization done in prepare? Just to be a little more lazy than in init? 2. In FieldSpellChecker, the getSuggestion method goes through and creates the suggested map, but then the loop over the entry set at the end only uses the value. I think our response should return the associated correction with the original token. 3. I'm working on the abstraction notion. The goal is to have a common response, no matter the spell checker, so that we can plug and play spell checkers. I hope to have a patch soon.
          Hide
          Shalin Shekhar Mangar added a comment -

          Grant, please hold on a bit. I'm working on the patch too and it has some refactorings which may make merging two patches difficult. I'll post my patch in a few minutes and then you can take over.

          Show
          Shalin Shekhar Mangar added a comment - Grant, please hold on a bit. I'm working on the patch too and it has some refactorings which may make merging two patches difficult. I'll post my patch in a few minutes and then you can take over.
          Hide
          Grant Ingersoll added a comment - - edited

          OK. Kind of too late, but no worries, I will manage the merge, so just do what you think you need to do.

          Show
          Grant Ingersoll added a comment - - edited OK. Kind of too late, but no worries, I will manage the merge, so just do what you think you need to do.
          Hide
          Shalin Shekhar Mangar added a comment -

          Grant – please find my comments below:

          1. I had to move the init to prepare because there were issues in getting access to the IndexReader in inform() method. Please see http://www.nabble.com/Accessing-IndexReader-during-core-initialization-hangs-init-to17259235.html
          2. The first getSuggestion method aims to return a single suggestion string by combining suggestions for all tokens in the query. It's not perfect but seems to work. This is used when spellcheck.count is missing or one. The second suggestSimilar method returns suggestions for each token and associated suggestion.
          3. That would be nice to have!
          Show
          Shalin Shekhar Mangar added a comment - Grant – please find my comments below: I had to move the init to prepare because there were issues in getting access to the IndexReader in inform() method. Please see http://www.nabble.com/Accessing-IndexReader-during-core-initialization-hangs-init-to17259235.html The first getSuggestion method aims to return a single suggestion string by combining suggestions for all tokens in the query. It's not perfect but seems to work. This is used when spellcheck.count is missing or one. The second suggestSimilar method returns suggestions for each token and associated suggestion. That would be nice to have!
          Hide
          Shalin Shekhar Mangar added a comment -

          This patch contains the following changes:

          1. Fixes bug reported by Oleg – Thanks to Bojan for this.
          2. thresholdTokenFrequency can be used to tweak the frequency of tokens being passed to spell check index. This is applied only for index type dictionaries.
          3. Moved getLines as an overloaded method to SolrResourceLoader.
          4. To avoid having a dependency to Lucene 2.4 (trunk) code, I created a wrapper class for PlainTextDictionary which calls it's protected constructor PlainTextDictionary(Reader)
          5. Uses Lucene's SpellChecker's overloaded suggestSimilar method which accepts the IndexReader as a param. This makes sure that when the query is present in the index, a different suggestion is not returned.
          6. Implements the onlyMorePopular only for dictionaries built from Solr fields
          7. Implements the extendedResults only for dictionaries built from Solr fields and only when spellcheck.count is greater than 1
          8. No need to specify spellcheck.dictionary as a request parameter if only one dictionary is configured.
          9. Accuracy is configurable through solrconfig.xml

          Still to do:

          1. It is possible to implement onlyMorePopular and extendedResults for dictionaries created from arbitary lucene indices too but I haven't looked into that yet.
          2. Tests are missing
          3. Add command to reload dictionaries
          Show
          Shalin Shekhar Mangar added a comment - This patch contains the following changes: Fixes bug reported by Oleg – Thanks to Bojan for this. thresholdTokenFrequency can be used to tweak the frequency of tokens being passed to spell check index. This is applied only for index type dictionaries. Moved getLines as an overloaded method to SolrResourceLoader. To avoid having a dependency to Lucene 2.4 (trunk) code, I created a wrapper class for PlainTextDictionary which calls it's protected constructor PlainTextDictionary(Reader) Uses Lucene's SpellChecker's overloaded suggestSimilar method which accepts the IndexReader as a param. This makes sure that when the query is present in the index, a different suggestion is not returned. Implements the onlyMorePopular only for dictionaries built from Solr fields Implements the extendedResults only for dictionaries built from Solr fields and only when spellcheck.count is greater than 1 No need to specify spellcheck.dictionary as a request parameter if only one dictionary is configured. Accuracy is configurable through solrconfig.xml Still to do: It is possible to implement onlyMorePopular and extendedResults for dictionaries created from arbitary lucene indices too but I haven't looked into that yet. Tests are missing Add command to reload dictionaries
          Hide
          Shalin Shekhar Mangar added a comment -

          Otis – Sorry, I missed your post earlier. I can't think of a use-case for adding frequency information to plain text files. Spell checker's utility comes from the fact that it can suggest keywords for which Solr can return documents. That is possible only when the tokens (or synonyms) are present in the Solr index. Plain text dictionaries will be used to add additional common keywords which may not be in the Solr fields used for suggestions but may be present in huge fields which you don't want to add to spell checker. For example, I may build my index only on vehicle brands but I may like to include terms such as "cars", "manufacturer", "make" from plain text files, which may be present in my huge default search field. Since the intent would be just to match some document with the given suggestion, frequency may not play a significant role here, IMHO. What do you think?

          Bojan – I think we should include an "exists" flag in the response. As for your point of queries with non-simple tokens, we can introduce another param like "spellcheck.q" to which the application can set the simple query. End users almost never know that Solr is running behind the scenes and the Solr queries are constructed by the application itself which can send the simple query in this way.

          Show
          Shalin Shekhar Mangar added a comment - Otis – Sorry, I missed your post earlier. I can't think of a use-case for adding frequency information to plain text files. Spell checker's utility comes from the fact that it can suggest keywords for which Solr can return documents. That is possible only when the tokens (or synonyms) are present in the Solr index. Plain text dictionaries will be used to add additional common keywords which may not be in the Solr fields used for suggestions but may be present in huge fields which you don't want to add to spell checker. For example, I may build my index only on vehicle brands but I may like to include terms such as "cars", "manufacturer", "make" from plain text files, which may be present in my huge default search field. Since the intent would be just to match some document with the given suggestion, frequency may not play a significant role here, IMHO. What do you think? Bojan – I think we should include an "exists" flag in the response. As for your point of queries with non-simple tokens, we can introduce another param like "spellcheck.q" to which the application can set the simple query. End users almost never know that Solr is running behind the scenes and the Solr queries are constructed by the application itself which can send the simple query in this way.
          Hide
          Otis Gospodnetic added a comment -

          Shalin – I think you are right. I looked at SpellChecker again and see that the frequency in the main/searchable index is checked at "suggest time", regardless of what the source of dictionary words (index or file), so frequency will be accounted for even when words are loaded from plain-text dictionary files.

          Unless I'm still missing something, that means that "onlyMorePopular" can (or should!) be used even when words are loaded from plain-text dictionary files. No?

          Show
          Otis Gospodnetic added a comment - Shalin – I think you are right. I looked at SpellChecker again and see that the frequency in the main/searchable index is checked at "suggest time", regardless of what the source of dictionary words (index or file), so frequency will be accounted for even when words are loaded from plain-text dictionary files. Unless I'm still missing something, that means that "onlyMorePopular" can (or should !) be used even when words are loaded from plain-text dictionary files. No?
          Hide
          Grant Ingersoll added a comment -

          Is the prepare thread-safe for dictionary creation? Seems like there is a race-condition on the construction of the dictionaries. I suppose we need a synchronize in there.

          Show
          Grant Ingersoll added a comment - Is the prepare thread-safe for dictionary creation? Seems like there is a race-condition on the construction of the dictionaries. I suppose we need a synchronize in there.
          Hide
          Shalin Shekhar Mangar added a comment -

          Grant – No, it is not thread-safe. Actually I wanted to put this initialization code in a inform method to avoid this situation. Since that did not work, I moved this into prepare method only as a stop gap arrangement. See http://www.nabble.com/Accessing-IndexReader-during-core-initialization-hangs-init-to17259235.html for details.

          I'd suggest doing the following:

          • Move the initial dictionary creation into a inform method if someone with more knowledge about the SolrCore class can fix the issue I described in my mail.
          • The code in prepare can be used to reload dictionaries by specifying a request parameter (say spellcheck.rebuild=true)
          • Since we're already using a ConcurrentHashMap, the above two things should take care of all thread-safety issues.
          Show
          Shalin Shekhar Mangar added a comment - Grant – No, it is not thread-safe. Actually I wanted to put this initialization code in a inform method to avoid this situation. Since that did not work, I moved this into prepare method only as a stop gap arrangement. See http://www.nabble.com/Accessing-IndexReader-during-core-initialization-hangs-init-to17259235.html for details. I'd suggest doing the following: Move the initial dictionary creation into a inform method if someone with more knowledge about the SolrCore class can fix the issue I described in my mail. The code in prepare can be used to reload dictionaries by specifying a request parameter (say spellcheck.rebuild=true) Since we're already using a ConcurrentHashMap, the above two things should take care of all thread-safety issues.
          Hide
          Grant Ingersoll added a comment -

          Otis,

          What's the use case behind:

          Oh, I see, you are reading field values from the index of the current core. I think that is fine, but wouldn't it also be good to be able to read field values from a vanilla Lucene index?

          Seems kind of strange based on what I know of index-based spelling, but I don't know everything about it.

          Show
          Grant Ingersoll added a comment - Otis, What's the use case behind: Oh, I see, you are reading field values from the index of the current core. I think that is fine, but wouldn't it also be good to be able to read field values from a vanilla Lucene index? Seems kind of strange based on what I know of index-based spelling, but I don't know everything about it.
          Hide
          Grant Ingersoll added a comment -

          WARNING: This patch compiles ONLY. I do NOT claim it is semantically equivalent to the earlier patches although that is my goal and I don't think I am far off. I have not tested it in any way, shape or form. I am only putting it up here as a first cut of the abstractions I have in mind, so please provide feedback based on that, especially in regards to the SolrSpellChecker class. Most interesting, there, is the passing in of the IndexReader. I know not all spellers are going to need the IndexReader, so ideally, it would be something that is passed in or set during the construction of the speller, but I don't think that will work, or at least I am not aware of how to make it work just yet.

          My next step is to add unit tests of the individual spell checkers and then the component itself.

          Show
          Grant Ingersoll added a comment - WARNING: This patch compiles ONLY . I do NOT claim it is semantically equivalent to the earlier patches although that is my goal and I don't think I am far off. I have not tested it in any way, shape or form. I am only putting it up here as a first cut of the abstractions I have in mind, so please provide feedback based on that, especially in regards to the SolrSpellChecker class. Most interesting, there, is the passing in of the IndexReader. I know not all spellers are going to need the IndexReader, so ideally, it would be something that is passed in or set during the construction of the speller, but I don't think that will work, or at least I am not aware of how to make it work just yet. My next step is to add unit tests of the individual spell checkers and then the component itself.
          Hide
          Grant Ingersoll added a comment -

          Move spelling core classes out of component package into it's own package, similar to highlighting, as spelling is just as important. Same caveats as last patch apply.

          Show
          Grant Ingersoll added a comment - Move spelling core classes out of component package into it's own package, similar to highlighting, as spelling is just as important. Same caveats as last patch apply.
          Hide
          Grant Ingersoll added a comment -

          Also included in that last patch is a (proposed) sample configuration:

          <searchComponent name="spellcheck" class="org.apache.solr.handler.component.SpellCheckComponent">
              <lst name="spellchecker">
                <str name="classname">org.apache.solr.spelling.IndexBasedSpellChecker</str>
                  <lst name="dictionary">
                    <str name="name">default</str>
                    <str name="field">word</str>
                    <str name="indexDir">c:/temp/spellindex</str>
                  </lst>
              </lst>
              <lst name="spellchecker">
                <str name="classname">org.apache.solr.spelling.FileBasedSpellChecker</str>
                <lst name="dictionary">
                  <str name="name">external</str>
                  <str name="sourceLocation">spellings.txt</str>
                  <str name="characterEncoding">UTF-8</str>
                  <str name="spellcheckIndexDir ">./spellchecker</str>
                </lst>
          
              </lst>
          
          
            </searchComponent>
          
          Show
          Grant Ingersoll added a comment - Also included in that last patch is a (proposed) sample configuration: <searchComponent name= "spellcheck" class= "org.apache.solr.handler.component.SpellCheckComponent" > <lst name= "spellchecker" > <str name= "classname" >org.apache.solr.spelling.IndexBasedSpellChecker</str> <lst name= "dictionary" > <str name= "name" > default </str> <str name= "field" >word</str> <str name= "indexDir" >c:/temp/spellindex</str> </lst> </lst> <lst name= "spellchecker" > <str name= "classname" >org.apache.solr.spelling.FileBasedSpellChecker</str> <lst name= "dictionary" > <str name= "name" >external</str> <str name= "sourceLocation" >spellings.txt</str> <str name= "characterEncoding" >UTF-8</str> <str name= "spellcheckIndexDir " >./spellchecker</str> </lst> </lst> </searchComponent>
          Hide
          Otis Gospodnetic added a comment -

          I'm still confused with some of the names in that config.
          indexDir looks like the path to the spellchecker index. But there is also spellcheckInexDir. Is there a functonal difference?

          Regarding the "wouldn't it also be good to be able to read field values from a vanilla Lucene index?" - the use case is that not all source indices should have to be Solr indices. What if I have a vanilla Lucene index on the machine and I want the SCRH to build a SC index from that index's "title" field? That is, I want the functionality of SCRH, but I don't have my Lucene index under Solr. Is that doable?

          Show
          Otis Gospodnetic added a comment - I'm still confused with some of the names in that config. indexDir looks like the path to the spellchecker index. But there is also spellcheckInexDir. Is there a functonal difference? Regarding the "wouldn't it also be good to be able to read field values from a vanilla Lucene index?" - the use case is that not all source indices should have to be Solr indices. What if I have a vanilla Lucene index on the machine and I want the SCRH to build a SC index from that index's "title" field? That is, I want the functionality of SCRH, but I don't have my Lucene index under Solr. Is that doable?
          Hide
          Grant Ingersoll added a comment -

          indexDir looks like the path to the spellchecker index. But there is also spellcheckInexDir. Is there a functonal difference?

          Good point, I fix that.

          Is that doable?

          Of course it is, I just didn't know why you would want to. I get the file based need, b/c that is where you can put overrides, but I just don't get the need for another index, since wouldn't it have to have the same frequencies, etc. to return appropriate suggestions?

          Show
          Grant Ingersoll added a comment - indexDir looks like the path to the spellchecker index. But there is also spellcheckInexDir. Is there a functonal difference? Good point, I fix that. Is that doable? Of course it is, I just didn't know why you would want to. I get the file based need, b/c that is where you can put overrides, but I just don't get the need for another index, since wouldn't it have to have the same frequencies, etc. to return appropriate suggestions?
          Hide
          Otis Gospodnetic added a comment -

          I think the choice of "appropriate suggestions" should be left to the user of this service. If it's easily doable, let's make it possible and put information about frequencies in an appropriate place.

          Show
          Otis Gospodnetic added a comment - I think the choice of "appropriate suggestions" should be left to the user of this service. If it's easily doable, let's make it possible and put information about frequencies in an appropriate place.
          Hide
          Otis Gospodnetic added a comment -

          Shalin/Grant:

          I think Bojan brings up some good questions:
          https://issues.apache.org/jira/browse/SOLR-572?focusedCommentId=12598752#action_12598752

          It looks like the call to SpellChecker.exist(...) really got lost:
          $ curl --silent https://issues.apache.org/jira/secure/attachment/12382691/SOLR-572.patch | grep 'exist('

          Show
          Otis Gospodnetic added a comment - Shalin/Grant: I think Bojan brings up some good questions: https://issues.apache.org/jira/browse/SOLR-572?focusedCommentId=12598752#action_12598752 It looks like the call to SpellChecker.exist(...) really got lost: $ curl --silent https://issues.apache.org/jira/secure/attachment/12382691/SOLR-572.patch | grep 'exist('
          Hide
          Grant Ingersoll added a comment -

          OK, this has some tests for the individual spell checkers. Still haven't tested starting it up as an individual component in Solr.

          Also, still needs a way to account for when the returned suggestion is the same word, thus indicating the word exists in the index.

          Show
          Grant Ingersoll added a comment - OK, this has some tests for the individual spell checkers. Still haven't tested starting it up as an individual component in Solr. Also, still needs a way to account for when the returned suggestion is the same word, thus indicating the word exists in the index.
          Hide
          Grant Ingersoll added a comment -

          Good stuff, but it ain't "Major"

          Show
          Grant Ingersoll added a comment - Good stuff, but it ain't "Major"
          Hide
          Shalin Shekhar Mangar added a comment - - edited

          I have a few comments after a quick look at the patch

          • Lets make SolrSpellChecker keep the standard init method structure akin to NamedListInitializedPlugin. Let the build method return the dictionary name. In the current patch, even if build fails, the spell checker would get added to the map.
          • I couldn't find where the SolrSpellChecker#build method is actually called apart from the tests.
          • Lets remove the SolrSpellChecker#getSuggestion(String query, IndexReader reader, boolean onlyMorePopular) method completely. The other getSuggestion method will be called with count=1 if count is absent in the query.
          • Rename AbstractLuceneSpellerSpellChecker to something shorter.
          • I would very much like to keep short names instead of complete class names. We should not force the user to remember or copy-paste our long internal class names just because we wanted to keep things pluggable. Sane defaults maybe?
          • The configuration looks scary. There's no value added by the repeated spellchecker nodes. I propose the following syntax:
            <searchComponent name="spellcheck" class="org.apache.solr.handler.component.SpellCheckComponent">
            	<lst name="dictionary">
            	  <!-- Optional, it is required when more than one dictionary is configured -->
            	  <str name="name">default</str>
            	  <!-- The type is optional, defaults to IndexBasedSpellChecker -->
            	  <str name="type">org.apache.solr.spelling.IndexBasedSpellChecker</str>
            	  <!-- Optional, if present, the following lucene index is used as source instead of Solr index  -->
            	  <str name="sourceLocation">c:/temp/myluceneindex</str>
            	  <!--
            	       Load tokens from the following field for spell checking, 
            	       analyzer for the field's type as defined in schema.xml are used
            	  -->
            	  <str name="field">word</str>
            	  <!-- Optional, by default use in-memory index (RAMDirectory) -->
            	  <str name="spellCheckIndexDir">c:/temp/spellindex</str>
            	</lst>
            	<lst name="dictionary">
            	  <str name="name">external</str>
            	  <str name="type">org.apache.solr.spelling.FileBasedSpellChecker</str>
            	  <str name="sourceLocation">spellings.txt</str>
            	  <!--
            	       Optional, if provided the analyzers for the given fieldType would be used.
            	       Otherwise, no analyzer at index-time and WhiteSpaceAnalyzer at query time is used.
            	       This fieldType should be defined in schema.xml
            	   -->
            	  <str name="fieldType">text</str>
            	  <!-- Optional, defaults to platform encoding -->
            	  <str name="characterEncoding">UTF-8</str>
            	  <str name="spellcheckIndexDir">./spellchecker</str>
            	</lst>
              </searchComponent>
            
          • Last but not the least, Grant, do you know of a freely available spell checker implementation that someone may want to plugin instead of the Lucene SpellChecker? In other words, is this a real use-case or something we're imagining up? If we don't know of something that can be used right now, maybe we're better off postponing this change until users really need it and ask for it. I don't like the complexity this feature is asking for.
          Show
          Shalin Shekhar Mangar added a comment - - edited I have a few comments after a quick look at the patch Lets make SolrSpellChecker keep the standard init method structure akin to NamedListInitializedPlugin. Let the build method return the dictionary name. In the current patch, even if build fails, the spell checker would get added to the map. I couldn't find where the SolrSpellChecker#build method is actually called apart from the tests. Lets remove the SolrSpellChecker#getSuggestion(String query, IndexReader reader, boolean onlyMorePopular) method completely. The other getSuggestion method will be called with count=1 if count is absent in the query. Rename AbstractLuceneSpellerSpellChecker to something shorter. I would very much like to keep short names instead of complete class names. We should not force the user to remember or copy-paste our long internal class names just because we wanted to keep things pluggable. Sane defaults maybe? The configuration looks scary. There's no value added by the repeated spellchecker nodes. I propose the following syntax: <searchComponent name= "spellcheck" class= "org.apache.solr.handler.component.SpellCheckComponent" > <lst name= "dictionary" > <!-- Optional, it is required when more than one dictionary is configured --> <str name= "name" > default </str> <!-- The type is optional, defaults to IndexBasedSpellChecker --> <str name= "type" > org.apache.solr.spelling.IndexBasedSpellChecker </str> <!-- Optional, if present, the following lucene index is used as source instead of Solr index --> <str name= "sourceLocation" > c:/temp/myluceneindex </str> <!-- Load tokens from the following field for spell checking, analyzer for the field's type as defined in schema.xml are used --> <str name= "field" > word </str> <!-- Optional, by default use in-memory index (RAMDirectory) --> <str name= "spellCheckIndexDir" > c:/temp/spellindex </str> </lst> <lst name= "dictionary" > <str name= "name" > external </str> <str name= "type" > org.apache.solr.spelling.FileBasedSpellChecker </str> <str name= "sourceLocation" > spellings.txt </str> <!-- Optional, if provided the analyzers for the given fieldType would be used. Otherwise, no analyzer at index-time and WhiteSpaceAnalyzer at query time is used. This fieldType should be defined in schema.xml --> <str name= "fieldType" > text </str> <!-- Optional, defaults to platform encoding --> <str name= "characterEncoding" > UTF-8 </str> <str name= "spellcheckIndexDir" > ./spellchecker </str> </lst> </searchComponent> Last but not the least, Grant, do you know of a freely available spell checker implementation that someone may want to plugin instead of the Lucene SpellChecker? In other words, is this a real use-case or something we're imagining up? If we don't know of something that can be used right now, maybe we're better off postponing this change until users really need it and ask for it. I don't like the complexity this feature is asking for.
          Hide
          Grant Ingersoll added a comment -

          Lets make SolrSpellChecker keep the standard init method structure akin to NamedListInitializedPlugin. Let the build method return the dictionary name. In the current patch, even if build fails, the spell checker would get added to the map.

          The approach is to then use the build in the prepare method, much like the cmd=rebuild. Thus, spelling index creation is much like in the RequestHandler mode and gets around the firstSearcher issue. I am working on the integration into the Component at the moment, which is why you only see it in the tests.

          So, I am not sure if this makes sense. Right now, I have it so that we extract the necessary pieces in the init, but then they are applied during build. I guess the question is what should happen if "build" fails? Should we just remove that speller and log a warning? Or should it throw an exception? I am leaning towards the former.

          Rename AbstractLuceneSpellerSpellChecker to something shorter.

          OK, I will try to think of something.

          Lets remove the SolrSpellChecker#getSuggestion(String query, IndexReader reader, boolean onlyMorePopular) method completely. The other getSuggestion method will be called with count=1 if count is absent in the query.

          I was just thinking the same thing. Done.

          The configuration looks scary. There's no value added by the repeated spellchecker nodes. I propose the following syntax:

          All Solr config's look scary to me! However...

          I can imagine an implementation that looks like:

          <lst name="spellchecker">
           <str name="classname">my.great.SpellChecker</str>
          </lst>
          

          I agree, however, we can flatten mine one level, but keep the name spellchecker instead of dictionary.

          Here's an iteration:

          <searchComponent name="spellcheck" class="org.apache.solr.handler.component.SpellCheckComponent">
              <lst name="defaults">
                <!-- omp = Only More Popular -->
                <str name="sc.omp">false</str>
                <!-- exr = Extended Results -->
                <str name="sc.exr">false</str>
                <!--  The number of suggestions to return -->
                <str name="sc.cnt">1</str>
              </lst>
              <lst name="spellchecker">
                <str name="classname">org.apache.solr.spelling.IndexBasedSpellChecker</str>
                <str name="name">default</str>
                <str name="field">text</str>
                <str name="indexDir">c:/temp/spellindex</str>
                
              </lst>
              <lst name="spellchecker">
                <str name="classname">org.apache.solr.spelling.FileBasedSpellChecker</str>
                <str name="name">external</str>
                <str name="sourceLocation">spellings.txt</str>
                <str name="characterEncoding">UTF-8</str>
                <str name="indexDir">./spellchecker</str>
              </lst>
            </searchComponent>
          

          Last but not the least, Grant, do you know of a freely available spell checker implementation that someone may want to plugin instead of the Lucene SpellChecker? In other words, is this a real use-case or something we're imagining up? If we don't know of something that can be used right now, maybe we're better off postponing this change until users really need it and ask for it. I don't like the complexity this feature is asking for.

          Yes, I have an immediate need for it. The Lucene SpellChecker isn't all that good, IMO, and I want to offer something different without having to fork and have my own SpellChecker Component when the output is the same.

          Show
          Grant Ingersoll added a comment - Lets make SolrSpellChecker keep the standard init method structure akin to NamedListInitializedPlugin. Let the build method return the dictionary name. In the current patch, even if build fails, the spell checker would get added to the map. The approach is to then use the build in the prepare method, much like the cmd=rebuild. Thus, spelling index creation is much like in the RequestHandler mode and gets around the firstSearcher issue. I am working on the integration into the Component at the moment, which is why you only see it in the tests. So, I am not sure if this makes sense. Right now, I have it so that we extract the necessary pieces in the init, but then they are applied during build. I guess the question is what should happen if "build" fails? Should we just remove that speller and log a warning? Or should it throw an exception? I am leaning towards the former. Rename AbstractLuceneSpellerSpellChecker to something shorter. OK, I will try to think of something. Lets remove the SolrSpellChecker#getSuggestion(String query, IndexReader reader, boolean onlyMorePopular) method completely. The other getSuggestion method will be called with count=1 if count is absent in the query. I was just thinking the same thing. Done. The configuration looks scary. There's no value added by the repeated spellchecker nodes. I propose the following syntax: All Solr config's look scary to me! However... I can imagine an implementation that looks like: <lst name= "spellchecker" > <str name= "classname" >my.great.SpellChecker</str> </lst> I agree, however, we can flatten mine one level, but keep the name spellchecker instead of dictionary. Here's an iteration: <searchComponent name= "spellcheck" class= "org.apache.solr.handler.component.SpellCheckComponent" > <lst name= "defaults" > <!-- omp = Only More Popular --> <str name= "sc.omp" > false </str> <!-- exr = Extended Results --> <str name= "sc.exr" > false </str> <!-- The number of suggestions to return --> <str name= "sc.cnt" >1</str> </lst> <lst name= "spellchecker" > <str name= "classname" >org.apache.solr.spelling.IndexBasedSpellChecker</str> <str name= "name" > default </str> <str name= "field" >text</str> <str name= "indexDir" >c:/temp/spellindex</str> </lst> <lst name= "spellchecker" > <str name= "classname" >org.apache.solr.spelling.FileBasedSpellChecker</str> <str name= "name" >external</str> <str name= "sourceLocation" >spellings.txt</str> <str name= "characterEncoding" >UTF-8</str> <str name= "indexDir" >./spellchecker</str> </lst> </searchComponent> Last but not the least, Grant, do you know of a freely available spell checker implementation that someone may want to plugin instead of the Lucene SpellChecker? In other words, is this a real use-case or something we're imagining up? If we don't know of something that can be used right now, maybe we're better off postponing this change until users really need it and ask for it. I don't like the complexity this feature is asking for. Yes, I have an immediate need for it. The Lucene SpellChecker isn't all that good, IMO, and I want to offer something different without having to fork and have my own SpellChecker Component when the output is the same.
          Hide
          Grant Ingersoll added a comment -

          If this is a default component, how do you setup the field to be used for spelling? Are you just using the default search field?
          Also, I don't think we should make it default, since there is this minor nit that it requires building the index first. I suppose that could be done on the first time spellings are requested, but that seems like it could all of a sudden cause a much longer return. By making it non-default, I think it forces the person doing the configuration to think more about the setup, since the setup of proper spelling is not trivial.

          Show
          Grant Ingersoll added a comment - If this is a default component, how do you setup the field to be used for spelling? Are you just using the default search field? Also, I don't think we should make it default, since there is this minor nit that it requires building the index first. I suppose that could be done on the first time spellings are requested, but that seems like it could all of a sudden cause a much longer return. By making it non-default, I think it forces the person doing the configuration to think more about the setup, since the setup of proper spelling is not trivial.
          Hide
          Otis Gospodnetic added a comment -

          Grant, which spellchecker are you plugging in?

          Show
          Otis Gospodnetic added a comment - Grant, which spellchecker are you plugging in?
          Hide
          Grant Ingersoll added a comment -

          More tests, slight reworking of how response gets generated by using SpellingResult so that we can enforce a contract with whatever implementation of the SolrSpellChecker we have (NamedList is just too weakly typed to be effective for this.)

          Incorporated suggestions from Shalin on configuration and other pieces.

          TODO: more tests, Add easy to find "exists" functionality when the suggestion is the same as the token.

          Getting closer to something to commit.

          Show
          Grant Ingersoll added a comment - More tests, slight reworking of how response gets generated by using SpellingResult so that we can enforce a contract with whatever implementation of the SolrSpellChecker we have (NamedList is just too weakly typed to be effective for this.) Incorporated suggestions from Shalin on configuration and other pieces. TODO: more tests, Add easy to find "exists" functionality when the suggestion is the same as the token. Getting closer to something to commit.
          Hide
          Shalin Shekhar Mangar added a comment -

          The configuration looks fine Grant. Yes, we don't need this as default. Default search fields are usually large, we don't need that overhead by default. The user can always enable and configure this when he needs it. Maybe we should add the sample configuration as a commented section in solrconfig.xml

          • We should change the query parameter to long names so that their purpose is easily understood. Names like "sc.omp" and "sc.exr" seem cryptic
          • We don't need the rebuild command since build and rebuild both do the same thing.
          • Add a optional spellcheck.q request parameter for passing in simple queries (to avoid the problem that Bojan pointed out)

          I'll give a patch shortly after making the above changes. Will also try to look into adding exists feature.

          Show
          Shalin Shekhar Mangar added a comment - The configuration looks fine Grant. Yes, we don't need this as default. Default search fields are usually large, we don't need that overhead by default. The user can always enable and configure this when he needs it. Maybe we should add the sample configuration as a commented section in solrconfig.xml We should change the query parameter to long names so that their purpose is easily understood. Names like "sc.omp" and "sc.exr" seem cryptic We don't need the rebuild command since build and rebuild both do the same thing. Add a optional spellcheck.q request parameter for passing in simple queries (to avoid the problem that Bojan pointed out) I'll give a patch shortly after making the above changes. Will also try to look into adding exists feature.
          Hide
          Bojan Smid added a comment - - edited

          Shalin, I'm not sure we really need spellcheck.q parameter. I think we should handle all queries in a similar way (both complex and simple queries):

          • Break each query into terms, and then for each term check if it was correctly spelled (with spellchecker.exist()). Some term types should be excluded from spell checking (range terms and other types I mentioned in the post above).
          • If all terms (which can be spell checked) in a query are correctly spelled, we put a flag correctlySpelled = true in the response, otherwise we put the flag to false and return suggestion (we change only terms for which spechecker.exist() returned false).

          What do you think of that?

          Show
          Bojan Smid added a comment - - edited Shalin, I'm not sure we really need spellcheck.q parameter. I think we should handle all queries in a similar way (both complex and simple queries): Break each query into terms, and then for each term check if it was correctly spelled (with spellchecker.exist()). Some term types should be excluded from spell checking (range terms and other types I mentioned in the post above). If all terms (which can be spell checked) in a query are correctly spelled, we put a flag correctlySpelled = true in the response, otherwise we put the flag to false and return suggestion (we change only terms for which spechecker.exist() returned false). What do you think of that?
          Hide
          Shalin Shekhar Mangar added a comment -

          Break each query into terms, and then for each term check if it was correctly spelled (with spellchecker.exist()). Some term types should be excluded from spell checking (range terms and other types I mentioned in the post above).

          We should not try to do intelligent things which the user can easily do. It's difficult to extract terms which represent range terms, wildcards, fuzzy queries and boolean operators. We will need a parser to identify and remove these things correctly from the query which is not something we should be doing. Since the user always builds the q parameter, he can also build the spellcheck.q parameter if he chooses to do so.

          If all terms (which can be spell checked) in a query are correctly spelled, we put a flag correctlySpelled = true in the response, otherwise we put the flag to false and return suggestion (we change only terms for which spechecker.exist() returned false).

          Agreed

          Show
          Shalin Shekhar Mangar added a comment - Break each query into terms, and then for each term check if it was correctly spelled (with spellchecker.exist()). Some term types should be excluded from spell checking (range terms and other types I mentioned in the post above). We should not try to do intelligent things which the user can easily do. It's difficult to extract terms which represent range terms, wildcards, fuzzy queries and boolean operators. We will need a parser to identify and remove these things correctly from the query which is not something we should be doing. Since the user always builds the q parameter, he can also build the spellcheck.q parameter if he chooses to do so. If all terms (which can be spell checked) in a query are correctly spelled, we put a flag correctlySpelled = true in the response, otherwise we put the flag to false and return suggestion (we change only terms for which spechecker.exist() returned false). Agreed
          Hide
          Grant Ingersoll added a comment - - edited

          Yeah, I was torn on this one, and am fine either way. Most of Solr's
          existing params are quite short and cryptic. I guess it is trying to
          prevent the GET buffer length problem (does that still exist?), but I
          don't know.

          I was just trying to keep some common ground w/ the ReqHandler
          version, which uses rebuild, but I agree build is shorter and you
          can't rebuild something until you build it, right?

          Not sure I follow here, but I'll wait for your patch.

          Show
          Grant Ingersoll added a comment - - edited Yeah, I was torn on this one, and am fine either way. Most of Solr's existing params are quite short and cryptic. I guess it is trying to prevent the GET buffer length problem (does that still exist?), but I don't know. I was just trying to keep some common ground w/ the ReqHandler version, which uses rebuild, but I agree build is shorter and you can't rebuild something until you build it, right? Not sure I follow here, but I'll wait for your patch.
          Hide
          Otis Gospodnetic added a comment -

          Yes, the GET length limit is still with us, but it's 2K+ chars. Here is info about IE7 and friends, for example: http://support.microsoft.com/kb/208427
          So I think we still have room a bit of room there.

          Show
          Otis Gospodnetic added a comment - Yes, the GET length limit is still with us, but it's 2K+ chars. Here is info about IE7 and friends, for example: http://support.microsoft.com/kb/208427 So I think we still have room a bit of room there.
          Hide
          Shalin Shekhar Mangar added a comment -

          Changes:

          1. Changed request parameters to use long names
          2. Removed the command syntax. Params are spellcheck.build=true or spellcheck.reload=true
          3. Uses spellcheck.q if present, otherwise q parameter
          4. Return correctlySpelled=true in result if all tokens in the input query are present in the index. Added a simple test for this change.
          5. Renamed indexDir in config to spellcheckIndexDir for clarity
          6. Updated SpellCheckComponentTest and test solrconfig.xml for the above changes
          Show
          Shalin Shekhar Mangar added a comment - Changes: Changed request parameters to use long names Removed the command syntax. Params are spellcheck.build=true or spellcheck.reload=true Uses spellcheck.q if present, otherwise q parameter Return correctlySpelled=true in result if all tokens in the input query are present in the index. Added a simple test for this change. Renamed indexDir in config to spellcheckIndexDir for clarity Updated SpellCheckComponentTest and test solrconfig.xml for the above changes
          Hide
          Grant Ingersoll added a comment -

          Return correctlySpelled=true in result if all tokens in the input query are present in the index. Added a simple test for this change.

          If I'm reading the code right, this only can be set if extendedResults is true since this is the only time there will be frequency information, yet, I believe with the upgrade to the latest Lucene spell checker that it now returns the original word as a suggestion if it is correctly spelled.

          Rest of changes look good.

          Show
          Grant Ingersoll added a comment - Return correctlySpelled=true in result if all tokens in the input query are present in the index. Added a simple test for this change. If I'm reading the code right, this only can be set if extendedResults is true since this is the only time there will be frequency information, yet, I believe with the upgrade to the latest Lucene spell checker that it now returns the original word as a suggestion if it is correctly spelled. Rest of changes look good.
          Hide
          Grant Ingersoll added a comment -

          Some more tests for various edge cases on the Index based Speller.

          Show
          Grant Ingersoll added a comment - Some more tests for various edge cases on the Index based Speller.
          Hide
          Oleg Gnatovskiy added a comment -

          Did you guys change the required URL parameters structure? I am hitting the following URL: http://localhost:8983/solr/select/?q=pizza&spellcheck=true&spellcheck.dictionary=default and I am getting a nullpointer exception. The config is the one from the sample, and I am using the latest patch.

          Show
          Oleg Gnatovskiy added a comment - Did you guys change the required URL parameters structure? I am hitting the following URL: http://localhost:8983/solr/select/?q=pizza&spellcheck=true&spellcheck.dictionary=default and I am getting a nullpointer exception. The config is the one from the sample, and I am using the latest patch.
          Hide
          Grant Ingersoll added a comment -

          Did you issue a build command first? Note, I haven't yet fully
          tested in the Solr container, have been more focused on individual
          unit tests.

          Also, what's the NPE you are getting?

          Show
          Grant Ingersoll added a comment - Did you issue a build command first? Note, I haven't yet fully tested in the Solr container, have been more focused on individual unit tests. Also, what's the NPE you are getting?
          Hide
          Otis Gospodnetic added a comment -

          I haven't applied/tried the latest patch yet, but maybe it's
          quicker/better to ask here. I'm wondering/worried about the case
          where the input is a multi-term query string and a subset (e.g. 2 of 5
          terms) of the query terms is misspelled.

          For example, what happens when the query is:

          "london brigge is fallinge down"
          (my 2 year old's current hit)

          In this case the suggestions should be:

          1. brigge => bridge
          2. fallinge => falling (or fall, more likely)

          Is there something in the response that will allow the client to
          figure out the positioning of the spelling suggestions and piece
          together the ideal alternative query, in this case "london bridge is
          falling/fall down"?

          Ideally, the client could piece the new query string, so that it can, for example, italicize the misspelled words (see Google's DYM). If the current SCRH returns the final corrected string, e.g. "london bridge is falling down" the client has no easy/accurate way of figuring out what was changed, I think. If the SCRH returned some mark-up that told the client which word(s) changed, then the client could do something with those changed words, e.g. "london bridge

          {was:brigge}

          ...."

          Or, if that has problems, maybe each word should be returned separately and sequentially:

          <word="london"/> <!-- unchanged -->
          <word="brigge">bridge</word>

          or maybe with offset info:

          <word="london" offset="0"/> <!-- unchanged -->
          <word="brigge" offset="6">bridge</word>

          Thoughts?

          Show
          Otis Gospodnetic added a comment - I haven't applied/tried the latest patch yet, but maybe it's quicker/better to ask here. I'm wondering/worried about the case where the input is a multi-term query string and a subset (e.g. 2 of 5 terms) of the query terms is misspelled. For example, what happens when the query is: "london brigge is fallinge down" (my 2 year old's current hit) In this case the suggestions should be: brigge => bridge fallinge => falling (or fall, more likely) Is there something in the response that will allow the client to figure out the positioning of the spelling suggestions and piece together the ideal alternative query, in this case "london bridge is falling/fall down"? Ideally, the client could piece the new query string, so that it can, for example, italicize the misspelled words (see Google's DYM). If the current SCRH returns the final corrected string, e.g. "london bridge is falling down" the client has no easy/accurate way of figuring out what was changed, I think. If the SCRH returned some mark-up that told the client which word(s) changed, then the client could do something with those changed words, e.g. "london bridge {was:brigge} ...." Or, if that has problems, maybe each word should be returned separately and sequentially: <word="london"/> <!-- unchanged --> <word="brigge">bridge</word> or maybe with offset info: <word="london" offset="0"/> <!-- unchanged --> <word="brigge" offset="6">bridge</word> Thoughts?
          Hide
          Oleg Gnatovskiy added a comment -

          Hello. I am hitting http://localhost:8983/solr/select/?q=pizza&spellcheck=true&spellcheck.dictionary=default&spellcheck.build=true when trying to build the dictionary. My config looks this this:
          <searchComponent name="spellcheck" class="org.apache.solr.handler.component.SpellCheckComponent">
          <lst name="defaults">
          <!-- omp = Only More Popular -->
          <str name="spellcheck.onlyMorePopular">false</str>
          <!-- exr = Extended Results -->
          <str name="spellcheck.extendedResults">false</str>
          <!-- The number of suggestions to return -->
          <str name="spellcheck.count">1</str>
          </lst>
          <lst name="spellchecker">
          <str name="classname">org.apache.solr.spelling.IndexBasedSpellChecker</str>
          <str name="name">default</str>
          <str name="fieldType">text_ws</str>
          <str name="indexDir">/usr/local/apache/lucene/solr1home/solr/data/spellchecker</str>

          </lst>
          <lst name="spellchecker">
          <str name="classname">org.apache.solr.spelling.FileBasedSpellChecker</str>
          <str name="name">external</str>
          <str name="sourceLocation">spellings.txt</str>
          <str name="fieldType">text_ws</str>
          <str name="characterEncoding">UTF-8</str>
          <str name="indexDir">/usr/local/apache/lucene/solr1home/solr/data/spellchecker</str>
          </lst>
          </searchComponent>

          And the NPE is:

          SEVERE: java.lang.NullPointerException
          at org.apache.solr.util.HighFrequencyDictionary.<init>(HighFrequencyDictionary.java:48)
          at org.apache.solr.spelling.IndexBasedSpellChecker.loadLuceneDictionary(IndexBasedSpellChecker.java:103)
          at org.apache.solr.spelling.IndexBasedSpellChecker.build(IndexBasedSpellChecker.java:84)
          at org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:133)
          at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:132)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:965)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:339)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:274)
          at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
          at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
          at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
          at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
          at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
          at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
          at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
          at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
          at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
          at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
          at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
          at java.lang.Thread.run(Thread.java:619)

          Show
          Oleg Gnatovskiy added a comment - Hello. I am hitting http://localhost:8983/solr/select/?q=pizza&spellcheck=true&spellcheck.dictionary=default&spellcheck.build=true when trying to build the dictionary. My config looks this this: <searchComponent name="spellcheck" class="org.apache.solr.handler.component.SpellCheckComponent"> <lst name="defaults"> <!-- omp = Only More Popular --> <str name="spellcheck.onlyMorePopular">false</str> <!-- exr = Extended Results --> <str name="spellcheck.extendedResults">false</str> <!-- The number of suggestions to return --> <str name="spellcheck.count">1</str> </lst> <lst name="spellchecker"> <str name="classname">org.apache.solr.spelling.IndexBasedSpellChecker</str> <str name="name">default</str> <str name="fieldType">text_ws</str> <str name="indexDir">/usr/local/apache/lucene/solr1home/solr/data/spellchecker</str> </lst> <lst name="spellchecker"> <str name="classname">org.apache.solr.spelling.FileBasedSpellChecker</str> <str name="name">external</str> <str name="sourceLocation">spellings.txt</str> <str name="fieldType">text_ws</str> <str name="characterEncoding">UTF-8</str> <str name="indexDir">/usr/local/apache/lucene/solr1home/solr/data/spellchecker</str> </lst> </searchComponent> And the NPE is: SEVERE: java.lang.NullPointerException at org.apache.solr.util.HighFrequencyDictionary.<init>(HighFrequencyDictionary.java:48) at org.apache.solr.spelling.IndexBasedSpellChecker.loadLuceneDictionary(IndexBasedSpellChecker.java:103) at org.apache.solr.spelling.IndexBasedSpellChecker.build(IndexBasedSpellChecker.java:84) at org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:133) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:132) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125) at org.apache.solr.core.SolrCore.execute(SolrCore.java:965) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:339) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:274) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619)
          Hide
          Grant Ingersoll added a comment - - edited

          I'm working on it. Will have a new patch soon.

          Show
          Grant Ingersoll added a comment - - edited I'm working on it. Will have a new patch soon.
          Hide
          Oleg Gnatovskiy added a comment -

          Is it an actual error, or was I missing something?

          Show
          Oleg Gnatovskiy added a comment - Is it an actual error, or was I missing something?
          Hide
          Oleg Gnatovskiy added a comment - - edited

          In response to Otis, I don't think each word should be returned individually. In fact it should probably return the entire phrase, with the suggestions inserted. I believe that is what google does. Although I guess if the words are returned sequentially, you can easily reform the phrase, so that works too...

          Show
          Oleg Gnatovskiy added a comment - - edited In response to Otis, I don't think each word should be returned individually. In fact it should probably return the entire phrase, with the suggestions inserted. I believe that is what google does. Although I guess if the words are returned sequentially, you can easily reform the phrase, so that works too...
          Hide
          Grant Ingersoll added a comment -

          All you see from Googs is their frontend, so who knows what their
          spell checker does. I think we should return the words individually,
          the application is responsible for doing the sewing together of the
          new string, IMO.

          Show
          Grant Ingersoll added a comment - All you see from Googs is their frontend, so who knows what their spell checker does. I think we should return the words individually, the application is responsible for doing the sewing together of the new string, IMO.
          Hide
          Oleg Gnatovskiy added a comment -

          Should we return suggestions only for the misspelled words, or should we echo the correctly spelled ones as well?

          Show
          Oleg Gnatovskiy added a comment - Should we return suggestions only for the misspelled words, or should we echo the correctly spelled ones as well?
          Hide
          Otis Gospodnetic added a comment -

          Right, Google only shows you the final output, not what they do in the backend.
          But the fact that they italicize misspelled words tells us they have a mechanism that allows the front end to identify them.
          So I think our task here is to figure out the best/easiest way for the client to identify misspelled words and offer the alternative query to the end user.

          I think what I outlined above will do that for us:

          • output all words sequentially
          • mark the words that are misspelled - it may be best to return the original word plus corrected word:

          <word="london"/> <!-- unchanged -->
          <word="brigge">bridge</word>

          or maybe with offset info:

          <word="london" offset="0"/> <!-- unchanged -->
          <word="brigge" offset="6">bridge</word>

          It's also fine to (also) return the final corrected string that doesn't mark the corrected words in any way, and let the "lazy" clients just use that.

          Grant or Shalin, will either of you be adding this?

          Show
          Otis Gospodnetic added a comment - Right, Google only shows you the final output, not what they do in the backend. But the fact that they italicize misspelled words tells us they have a mechanism that allows the front end to identify them. So I think our task here is to figure out the best/easiest way for the client to identify misspelled words and offer the alternative query to the end user. I think what I outlined above will do that for us: output all words sequentially mark the words that are misspelled - it may be best to return the original word plus corrected word: <word="london"/> <!-- unchanged --> <word="brigge">bridge</word> or maybe with offset info: <word="london" offset="0"/> <!-- unchanged --> <word="brigge" offset="6">bridge</word> It's also fine to ( also ) return the final corrected string that doesn't mark the corrected words in any way, and let the "lazy" clients just use that. Grant or Shalin, will either of you be adding this?
          Hide
          Grant Ingersoll added a comment -

          Grant or Shalin, will either of you be adding this?

          Yes, I am working on it.

          Show
          Grant Ingersoll added a comment - Grant or Shalin, will either of you be adding this? Yes, I am working on it.
          Hide
          Oleg Gnatovskiy added a comment -

          I am still confused about my NPE. Was that a config issue on my part, or was it a bug? The way Grant said he was working on it, I assumed that it was a bug

          Show
          Oleg Gnatovskiy added a comment - I am still confused about my NPE. Was that a config issue on my part, or was it a bug? The way Grant said he was working on it, I assumed that it was a bug
          Hide
          Grant Ingersoll added a comment -

          Your "field" is null for your Lucene configuration. You need to
          specify:

          <str name="field">fieldName</str>

          You have fieldType instead.

          -Grant

          Show
          Grant Ingersoll added a comment - Your "field" is null for your Lucene configuration. You need to specify: <str name="field">fieldName</str> You have fieldType instead. -Grant
          Hide
          Grant Ingersoll added a comment -

          OK, here's a start on the token stuff.

          NOTE: This currently does not work!!!!!!!! The tests do not pass and I haven't fully implemented the SpellingQueryConverter. I have a few other things to attend to for a couple of days, so I wanted to get this up there as a starting point for others to look at and give comments on the approach for when I can get back to it in a day or two (but feel free to take it up, too).

          The basic gist of it is to hand off analysis to a pluggable piece called the SpellingQueryConverter, which produces a collection of Tokens (which contain offsets into the original query String).

          I'm still torn on how to best achieve this. In some sense, there has to be some interaction with some form of a Query Parser. I think it needs to be a Query Parser that has the source field's Analyzer as the Analyzer for doing the parsing. This way, the output Query is properly analyzed and we can then extract just those "spellcheckable" terms from it (i.e. TermQuery, PhraseQuery, ????)

          Does this make sense?

          Show
          Grant Ingersoll added a comment - OK, here's a start on the token stuff. NOTE: This currently does not work!!!!!!!! The tests do not pass and I haven't fully implemented the SpellingQueryConverter. I have a few other things to attend to for a couple of days, so I wanted to get this up there as a starting point for others to look at and give comments on the approach for when I can get back to it in a day or two (but feel free to take it up, too). The basic gist of it is to hand off analysis to a pluggable piece called the SpellingQueryConverter, which produces a collection of Tokens (which contain offsets into the original query String). I'm still torn on how to best achieve this. In some sense, there has to be some interaction with some form of a Query Parser. I think it needs to be a Query Parser that has the source field's Analyzer as the Analyzer for doing the parsing. This way, the output Query is properly analyzed and we can then extract just those "spellcheckable" terms from it (i.e. TermQuery, PhraseQuery, ????) Does this make sense?
          Hide
          Grant Ingersoll added a comment -

          OK, I changed the SpellingQueryConverter to not be dependent on the Query, instead opting for a simple regex approach. It is by no means perfect, but I think it is an improvement. All the tests now pass. See the test solrconfig for how to configure.

          This time, I mean it, I won't be working on this for a couple of days more or less, depending on other tasks

          Show
          Grant Ingersoll added a comment - OK, I changed the SpellingQueryConverter to not be dependent on the Query, instead opting for a simple regex approach. It is by no means perfect, but I think it is an improvement. All the tests now pass. See the test solrconfig for how to configure. This time, I mean it, I won't be working on this for a couple of days more or less, depending on other tasks
          Hide
          Shalin Shekhar Mangar added a comment -

          Grant, unless I'm mistaken, the reason to add spellcheck.q parameter was to avoid the tedious query parsing logic that may be needed to extract "spellcheckable" terms from the q parameter. Do we really need to do this? All the extra things in the q parameter are usually added by the frontend itself, isn't it?

          Show
          Shalin Shekhar Mangar added a comment - Grant, unless I'm mistaken, the reason to add spellcheck.q parameter was to avoid the tedious query parsing logic that may be needed to extract "spellcheckable" terms from the q parameter. Do we really need to do this? All the extra things in the q parameter are usually added by the frontend itself, isn't it?
          Hide
          Grant Ingersoll added a comment -

          Grant, unless I'm mistaken, the reason to add spellcheck.q parameter was to avoid the tedious query parsing logic that may be needed to extract "spellcheckable" terms from the q parameter. Do we really need to do this? All the extra things in the q parameter are usually added by the frontend itself, isn't it?

          Is that practical? How would an application even know how to generate spellcheck.q without parsing, etc.? I think the component should just work on the input query. I guess I hadn't really thought about the need for spellcheck.q before, but now that you put it in that light, I am not sure I see the need for it.

          I don't think all the extra things are necessarily added by the application. Users can input range queries, etc. The point is, it all depends on the application.

          At any rate, it is trivial to override the SpellingQueryConverter to not do the original REGEX and just apply the analyzer to produce the tokens. I suppose, we could offer two converters, one w/ the regex, and one without, or it could just have a flag.

          Show
          Grant Ingersoll added a comment - Grant, unless I'm mistaken, the reason to add spellcheck.q parameter was to avoid the tedious query parsing logic that may be needed to extract "spellcheckable" terms from the q parameter. Do we really need to do this? All the extra things in the q parameter are usually added by the frontend itself, isn't it? Is that practical? How would an application even know how to generate spellcheck.q without parsing, etc.? I think the component should just work on the input query. I guess I hadn't really thought about the need for spellcheck.q before, but now that you put it in that light, I am not sure I see the need for it. I don't think all the extra things are necessarily added by the application. Users can input range queries, etc. The point is, it all depends on the application. At any rate, it is trivial to override the SpellingQueryConverter to not do the original REGEX and just apply the analyzer to produce the tokens. I suppose, we could offer two converters, one w/ the regex, and one without, or it could just have a flag.
          Hide
          Oleg Gnatovskiy added a comment -

          I still have some issues. Here is my config:
          <lst name="spellchecker">
          <str name="classname">org.apache.solr.spelling.FileBasedSpellChecker</str>
          <str name="name">external</str>
          <str name="sourceLocation">/usr/local/apache/lucene/solr1home/conf/spellings.txt</str>
          <str name="field">word</str>
          <str name="characterEncoding">UTF-8</str>
          <!-<str name="indexDir">/usr/local/apache/lucene/solr1home/solr/data/spellchecker</str>->
          </lst>
          But why do I need a field for a filebased dictionary? Also is the correct way to call this URL: http://wil1devsch1.cs.tmcs:8983/solr/select/?q=pizza&spellcheck=true&spellcheck.dictionary=external&spellcheck.builld=true ?

          Show
          Oleg Gnatovskiy added a comment - I still have some issues. Here is my config: <lst name="spellchecker"> <str name="classname">org.apache.solr.spelling.FileBasedSpellChecker</str> <str name="name">external</str> <str name="sourceLocation">/usr/local/apache/lucene/solr1home/conf/spellings.txt</str> <str name="field">word</str> <str name="characterEncoding">UTF-8</str> <!- <str name="indexDir">/usr/local/apache/lucene/solr1home/solr/data/spellchecker</str> -> </lst> But why do I need a field for a filebased dictionary? Also is the correct way to call this URL: http://wil1devsch1.cs.tmcs:8983/solr/select/?q=pizza&spellcheck=true&spellcheck.dictionary=external&spellcheck.builld=true ?
          Hide
          Shalin Shekhar Mangar added a comment -

          Oleg – You shouldn't need "field" for a file-based dictionary. "fieldType" is optional for file-based dictionary. "field" is necessary only when you're using a IndexBasedSpellChecker. If you're running into a problem it's a bug. Except for the double L in spellcheck.build in your URL, everything else looks Ok.

          Show
          Shalin Shekhar Mangar added a comment - Oleg – You shouldn't need "field" for a file-based dictionary. "fieldType" is optional for file-based dictionary. "field" is necessary only when you're using a IndexBasedSpellChecker. If you're running into a problem it's a bug. Except for the double L in spellcheck.build in your URL, everything else looks Ok.
          Hide
          Oleg Gnatovskiy added a comment - - edited

          Here is what I am getting (using yesterday's patch):

          HTTP Status 500 - null java.lang.NullPointerException at org.apache.lucene.index.Term.<init>(Term.java:39) at org.apache.lucene.index.Term.<init>(Term.java:36) at org.apache.solr.spelling.AbstractLuceneSpellChecker.getSuggestions(AbstractLuceneSpellChecker.java:67) at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:160) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:153) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125) at org.apache.solr.core.SolrCore.execute(SolrCore.java:965) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:339) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:274) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619)

          Show
          Oleg Gnatovskiy added a comment - - edited Here is what I am getting (using yesterday's patch): HTTP Status 500 - null java.lang.NullPointerException at org.apache.lucene.index.Term.<init>(Term.java:39) at org.apache.lucene.index.Term.<init>(Term.java:36) at org.apache.solr.spelling.AbstractLuceneSpellChecker.getSuggestions(AbstractLuceneSpellChecker.java:67) at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:160) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:153) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125) at org.apache.solr.core.SolrCore.execute(SolrCore.java:965) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:339) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:274) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619)
          Hide
          Noble Paul added a comment -

          We must consider committing a basic version of spellchecker without the intelligent query parsing etc. Most of the users need will be met . Adding enhancements later is not a bad idea. (as long as we are not breaking backward compatibility)

          Show
          Noble Paul added a comment - We must consider committing a basic version of spellchecker without the intelligent query parsing etc. Most of the users need will be met . Adding enhancements later is not a bad idea. (as long as we are not breaking backward compatibility)
          Hide
          Shalin Shekhar Mangar added a comment -

          Oleg, please try this patch. There was a bug in the previous patch which tried to use "field" for suggestions even when it was null. That is why it gave a NullPointerException with FileBasedSpellChecker

          Show
          Shalin Shekhar Mangar added a comment - Oleg, please try this patch. There was a bug in the previous patch which tried to use "field" for suggestions even when it was null. That is why it gave a NullPointerException with FileBasedSpellChecker
          Hide
          Shalin Shekhar Mangar added a comment -

          I had missed the src/test/test-files/spellings.txt in the previous patch so tests were failing. This patch adds it back.

          Show
          Shalin Shekhar Mangar added a comment - I had missed the src/test/test-files/spellings.txt in the previous patch so tests were failing. This patch adds it back.
          Hide
          Oleg Gnatovskiy added a comment - - edited

          I installed the latest patch. Still getting a NPE. Here is my config:

          <searchComponent name="spellcheck" class="org.apache.solr.handler.component.SpellCheckComponent">
          <lst name="defaults">
          <!-- omp = Only More Popular -->
          <str name="spellcheck.onlyMorePopular">false</str>
          <!-- exr = Extended Results -->
          <str name="spellcheck.extendedResults">false</str>
          <!-- The number of suggestions to return -->
          <str name="spellcheck.count">1</str>
          </lst>

          <lst name="spellchecker">
          <str name="classname">org.apache.solr.spelling.FileBasedSpellChecker</str>
          <str name="name">external</str>
          <str name="sourceLocation">spellings.txt</str>
          <str name="characterEncoding">UTF-8</str>
          <str name="fieldType">text_ws</str>
          <str name="indexDir">/usr/local/apache/lucene/solr2home/solr/data/spellIndex</str>
          </lst>
          </searchComponent>

          Here is the URL I am hitting: http://localhost:8983/solr/select/?q=pizza&spellcheck=true&spellcheck.dictionary=external&spellcheck.build=true

          Here is the error:

          HTTP Status 500 - null java.lang.NullPointerException at org.apache.lucene.index.Term.<init>(Term.java:39) at org.apache.lucene.index.Term.<init>(Term.java:36) at org.apache.lucene.search.spell.SpellChecker.suggestSimilar(SpellChecker.java:228) at org.apache.solr.spelling.AbstractLuceneSpellChecker.getSuggestions(AbstractLuceneSpellChecker.java:71) at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:177) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:153) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125) at org.apache.solr.core.SolrCore.execute(SolrCore.java:965) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:339) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:274) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619)

          spellings.txt is in my solr/home/conf.

          Show
          Oleg Gnatovskiy added a comment - - edited I installed the latest patch. Still getting a NPE. Here is my config: <searchComponent name="spellcheck" class="org.apache.solr.handler.component.SpellCheckComponent"> <lst name="defaults"> <!-- omp = Only More Popular --> <str name="spellcheck.onlyMorePopular">false</str> <!-- exr = Extended Results --> <str name="spellcheck.extendedResults">false</str> <!-- The number of suggestions to return --> <str name="spellcheck.count">1</str> </lst> <lst name="spellchecker"> <str name="classname">org.apache.solr.spelling.FileBasedSpellChecker</str> <str name="name">external</str> <str name="sourceLocation">spellings.txt</str> <str name="characterEncoding">UTF-8</str> <str name="fieldType">text_ws</str> <str name="indexDir">/usr/local/apache/lucene/solr2home/solr/data/spellIndex</str> </lst> </searchComponent> Here is the URL I am hitting: http://localhost:8983/solr/select/?q=pizza&spellcheck=true&spellcheck.dictionary=external&spellcheck.build=true Here is the error: HTTP Status 500 - null java.lang.NullPointerException at org.apache.lucene.index.Term.<init>(Term.java:39) at org.apache.lucene.index.Term.<init>(Term.java:36) at org.apache.lucene.search.spell.SpellChecker.suggestSimilar(SpellChecker.java:228) at org.apache.solr.spelling.AbstractLuceneSpellChecker.getSuggestions(AbstractLuceneSpellChecker.java:71) at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:177) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:153) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125) at org.apache.solr.core.SolrCore.execute(SolrCore.java:965) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:339) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:274) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) spellings.txt is in my solr/home/conf.
          Hide
          Grant Ingersoll added a comment -

          One thing I haven't quite settled in my mind is the use of the File based spell checker. It seems to me, that the use case for this is as an override where one feels the index based spelling is not correct. Is that right? Or am I missing something?

          If it is the case, shouldn't we allow the option, at least, of it truly acting as an override? Currently, the only way to get at it is by passing the dictionary name as the param. The only way I can see this as useful is if you are making several round trips to the server, which means you might as well be using a request handler and not a search component.

          Thoughts?

          Show
          Grant Ingersoll added a comment - One thing I haven't quite settled in my mind is the use of the File based spell checker. It seems to me, that the use case for this is as an override where one feels the index based spelling is not correct. Is that right? Or am I missing something? If it is the case, shouldn't we allow the option, at least, of it truly acting as an override? Currently, the only way to get at it is by passing the dictionary name as the param. The only way I can see this as useful is if you are making several round trips to the server, which means you might as well be using a request handler and not a search component. Thoughts?
          Hide
          Bojan Smid added a comment -

          File based spell checker would probably be used in cases when Solr index is too small or too young. So a user would compile a dictionary file (for instance, UNIX words file) and use it as a dictionary.

          Show
          Bojan Smid added a comment - File based spell checker would probably be used in cases when Solr index is too small or too young. So a user would compile a dictionary file (for instance, UNIX words file) and use it as a dictionary.
          Hide
          Grant Ingersoll added a comment -

          File based spell checker would probably be used in cases when Solr index is too small or too young. So a user would compile a dictionary file (for instance, UNIX words file) and use it as a dictionary.

          But how is it useful to return results that aren't in the index? It's not like querying on them results in anything useful. Seems to me, that in this case, you just need to rebuild your dictionary on a regular basis. Or is it that people are using Solr as a spelling server?

          Now, I can see it as an override situation. i.e. one wishes to override certain results from the index based one with ones that are in known to be in the dictionary, but are lower down.

          Show
          Grant Ingersoll added a comment - File based spell checker would probably be used in cases when Solr index is too small or too young. So a user would compile a dictionary file (for instance, UNIX words file) and use it as a dictionary. But how is it useful to return results that aren't in the index? It's not like querying on them results in anything useful. Seems to me, that in this case, you just need to rebuild your dictionary on a regular basis. Or is it that people are using Solr as a spelling server? Now, I can see it as an override situation. i.e. one wishes to override certain results from the index based one with ones that are in known to be in the dictionary, but are lower down.
          Hide
          Grant Ingersoll added a comment -

          Oleg,

          Can you try specifying a field value anyway for your bug up above? I think this is actually a bug in the Lucene Spell checker. Namely, the docs say that the field value can be null, but, it is trying to construct a Term, which requires a non-null field name.

          Just give it the name "word", perhaps

          Show
          Grant Ingersoll added a comment - Oleg, Can you try specifying a field value anyway for your bug up above? I think this is actually a bug in the Lucene Spell checker. Namely, the docs say that the field value can be null, but, it is trying to construct a Term, which requires a non-null field name. Just give it the name "word", perhaps
          Hide
          Otis Gospodnetic added a comment -

          Grant, I think it's better to think of people using Solr+SCRH as a (generic) spellchecker service, not necessarily something that absolutely has to tie to a specific index and thus make only suggestions that result in hits.

          Another use case is where Solr is used with indices that are not indices for a narrow domain or that don't have nice, clean, short fields that can be used for populating the SC index. For example, if the index consists of a pile of web pages, I don't think I'd want to use their data (not even their titles) to populate the SC index. I'd really want just a plain dictionary-powered SCRH.

          Show
          Otis Gospodnetic added a comment - Grant, I think it's better to think of people using Solr+SCRH as a (generic) spellchecker service, not necessarily something that absolutely has to tie to a specific index and thus make only suggestions that result in hits. Another use case is where Solr is used with indices that are not indices for a narrow domain or that don't have nice, clean, short fields that can be used for populating the SC index. For example, if the index consists of a pile of web pages, I don't think I'd want to use their data (not even their titles) to populate the SC index. I'd really want just a plain dictionary-powered SCRH.
          Hide
          Shalin Shekhar Mangar added a comment -

          Grant – The exception is happening because the SpellCheckComponent always passes Solr's own IndexReader when calling the AbstractLuceneSpellChecker#getSuggestions method even when the underlying spell checker is a FileBasedSpellChecker. In that case, since a non-null IndexReader is passed onto Lucene, it tries to create a term on the null field name. That is when the NullPointerException comes up.

          Another problem will occur when using IndexBasedSpellChecker with an arbitary Lucene index, because then too, the Solr's IndexReader would be passed to Lucene SpellChecker instead of the actual index's reader.

          I think a possible solution can be to add another abstract method with the same signature as Lucene's SpellChecker to the AbstractLuceneSpellChecker and let each sub-class get suggestions on it's own. That way FileBasedSpellChecker will pass the correct IndexReader or a null IndexReader into Lucene appropriately. The AbstractLuceneSpellChecker#getSuggestion will just call the underlying suggest method, get the String[] back and process as it does right now.

          Show
          Shalin Shekhar Mangar added a comment - Grant – The exception is happening because the SpellCheckComponent always passes Solr's own IndexReader when calling the AbstractLuceneSpellChecker#getSuggestions method even when the underlying spell checker is a FileBasedSpellChecker. In that case, since a non-null IndexReader is passed onto Lucene, it tries to create a term on the null field name. That is when the NullPointerException comes up. Another problem will occur when using IndexBasedSpellChecker with an arbitary Lucene index, because then too, the Solr's IndexReader would be passed to Lucene SpellChecker instead of the actual index's reader. I think a possible solution can be to add another abstract method with the same signature as Lucene's SpellChecker to the AbstractLuceneSpellChecker and let each sub-class get suggestions on it's own. That way FileBasedSpellChecker will pass the correct IndexReader or a null IndexReader into Lucene appropriately. The AbstractLuceneSpellChecker#getSuggestion will just call the underlying suggest method, get the String[] back and process as it does right now.
          Hide
          Grant Ingersoll added a comment -

          OK, here's another crack at it. I think I fixed the field issue Oleg was seeing (but haven't fully tested that) and I have it up and running in the Solr example. After indexing the example docs there, try something like:

          http://localhost:8983/solr/spellCheckCompRH/?q=iPoo+text:sola&version=2.2&start=0&rows=10&indent=on&spellcheck=true&spellcheck.build=true
          

          to build it and spell check the query.

          I also have, what I think is a good compromise on spell checking the CommonParams.Q and the Spellcheck.Q, namely, the latter just uses a whitespace tokenizer to create the tokens.

          I am also thinking of adding a "collate" functionality, which would take the top suggestions and splice them back into the original string, as this seems like somehting many apps would like to have.

          Show
          Grant Ingersoll added a comment - OK, here's another crack at it. I think I fixed the field issue Oleg was seeing (but haven't fully tested that) and I have it up and running in the Solr example. After indexing the example docs there, try something like: http: //localhost:8983/solr/spellCheckCompRH/?q=iPoo+text:sola&version=2.2&start=0&rows=10&indent=on&spellcheck= true &spellcheck.build= true to build it and spell check the query. I also have, what I think is a good compromise on spell checking the CommonParams.Q and the Spellcheck.Q, namely, the latter just uses a whitespace tokenizer to create the tokens. I am also thinking of adding a "collate" functionality, which would take the top suggestions and splice them back into the original string, as this seems like somehting many apps would like to have.
          Hide
          Otis Gospodnetic added a comment -

          By "collate" you mean that the SCRH would not only return suggestions/corrections for individual token, but it would also try to glue together an already corrected query string based on its suggestions?

          Example:
          Query: cogito ega sum

          SCRH returns this correction:
          erga -> ergo

          But also tries to give you the whole thing corrected:
          cogito ergo sum

          That? Sounds useful - less work for the client app, should the app developers decide that SCRH's collated suggestions are what they would have to do themselves anyway.

          Show
          Otis Gospodnetic added a comment - By "collate" you mean that the SCRH would not only return suggestions/corrections for individual token, but it would also try to glue together an already corrected query string based on its suggestions? Example: Query: cogito ega sum SCRH returns this correction: erga -> ergo But also tries to give you the whole thing corrected: cogito ergo sum That? Sounds useful - less work for the client app, should the app developers decide that SCRH's collated suggestions are what they would have to do themselves anyway.
          Hide
          Oleg Gnatovskiy added a comment -

          Hey guys. Installed the latest patch. Old problem is still there. For example if I do q=pizzzzza I get:

          <lst name="spellcheck">

          <lst name="suggestions">

          <lst name="pizzza">
          <int name="numFound">1</int>
          <int name="startOffset">0</int>
          <int name="endOffset">6</int>

          <arr name="suggestion">
          <str>pizza</str>
          </arr>
          </lst>
          </lst>
          </lst>

          Which is good. Then I do q=pizza (pizza is in the dictionary)

          lst name="spellcheck">

          <lst name="suggestions">

          <lst name="pizza">
          <int name="numFound">1</int>
          <int name="startOffset">0</int>
          <int name="endOffset">5</int>

          <arr name="suggestion">
          <str>plaza</str>
          </arr>
          </lst>
          </lst>
          </lst>

          I don't think it should give me that suggestion. If a word is in the dictionary it should not give any suggestions.

          Show
          Oleg Gnatovskiy added a comment - Hey guys. Installed the latest patch. Old problem is still there. For example if I do q=pizzzzza I get: <lst name="spellcheck"> <lst name="suggestions"> <lst name="pizzza"> <int name="numFound">1</int> <int name="startOffset">0</int> <int name="endOffset">6</int> <arr name="suggestion"> <str>pizza</str> </arr> </lst> </lst> </lst> Which is good. Then I do q=pizza (pizza is in the dictionary) lst name="spellcheck"> <lst name="suggestions"> <lst name="pizza"> <int name="numFound">1</int> <int name="startOffset">0</int> <int name="endOffset">5</int> <arr name="suggestion"> <str>plaza</str> </arr> </lst> </lst> </lst> I don't think it should give me that suggestion. If a word is in the dictionary it should not give any suggestions.
          Hide
          Grant Ingersoll added a comment -

          Grant - The exception is happening because the SpellCheckComponent always passes Solr's own IndexReader when calling the AbstractLuceneSpellChecker#getSuggestions method even when the underlying spell checker is a FileBasedSpellChecker. In that case, since a non-null IndexReader is passed onto Lucene, it tries to create a term on the null field name. That is when the NullPointerException comes up.

          Yep, I think I fixed this piece. See also LUCENE-1299

          I think a possible solution can be to add another abstract method with the same signature as Lucene's SpellChecker to the AbstractLuceneSpellChecker and let each sub-class get suggestions on it's own. That way FileBasedSpellChecker will pass the correct IndexReader or a null IndexReader into Lucene appropriately. The AbstractLuceneSpellChecker#getSuggestion will just call the underlying suggest method, get the String[] back and process as it does right now.

          Not sure I follow the solution (I understand the problem) Which signature are you suggesting?

          Show
          Grant Ingersoll added a comment - Grant - The exception is happening because the SpellCheckComponent always passes Solr's own IndexReader when calling the AbstractLuceneSpellChecker#getSuggestions method even when the underlying spell checker is a FileBasedSpellChecker. In that case, since a non-null IndexReader is passed onto Lucene, it tries to create a term on the null field name. That is when the NullPointerException comes up. Yep, I think I fixed this piece. See also LUCENE-1299 I think a possible solution can be to add another abstract method with the same signature as Lucene's SpellChecker to the AbstractLuceneSpellChecker and let each sub-class get suggestions on it's own. That way FileBasedSpellChecker will pass the correct IndexReader or a null IndexReader into Lucene appropriately. The AbstractLuceneSpellChecker#getSuggestion will just call the underlying suggest method, get the String[] back and process as it does right now. Not sure I follow the solution (I understand the problem) Which signature are you suggesting?
          Hide
          Grant Ingersoll added a comment -

          OK, I think this one is pretty good. I added a test for the alternate location piece. I think I also fixed the issues w/ the wrong IndexReader being passed around.

          I didn't implement the collate thing yet, but I think that can be handled as a separate patch.

          Show
          Grant Ingersoll added a comment - OK, I think this one is pretty good. I added a test for the alternate location piece. I think I also fixed the issues w/ the wrong IndexReader being passed around. I didn't implement the collate thing yet, but I think that can be handled as a separate patch.
          Hide
          Mike Klaas added a comment -

          [quote]Another use case is where Solr is used with indices that are not indices for a narrow domain or that don't have nice, clean, short fields that can be used for populating the SC index. For example, if the index consists of a pile of web pages, I don't think I'd want to use their data (not even their titles) to populate the SC index. I'd really want just a plain dictionary-powered SCRH.[/quote]

          It works great, actually. That was you get all the abbreviations, jargon, proper names, etc. Thresholding help prevent most of the cruft from appearing in the index.

          Show
          Mike Klaas added a comment - [quote] Another use case is where Solr is used with indices that are not indices for a narrow domain or that don't have nice, clean, short fields that can be used for populating the SC index. For example, if the index consists of a pile of web pages, I don't think I'd want to use their data (not even their titles) to populate the SC index. I'd really want just a plain dictionary-powered SCRH. [/quote] It works great, actually. That was you get all the abbreviations, jargon, proper names, etc. Thresholding help prevent most of the cruft from appearing in the index.
          Hide
          Swarag Segu added a comment -

          Hey Guys,
          I installed the latest patch and it gives me compile errors :

          compile:
          [mkdir] Created dir: C:\Documents and Settings\Swarag Segu\workspace\solrSrc\build\core
          [javac] Compiling 324 source files to C:\Documents and Settings\Swarag Segu\workspace\solrSrc\build\core
          [javac] C:\Documents and Settings\Swarag Segu\workspace\solrSrc\src\java\org\apache\solr\spelling\FileBasedSpellChecker.java:97: cannot find symbol
          [javac] symbol : variable MaxFieldLength
          [javac] location: class org.apache.lucene.index.IndexWriter
          [javac] true, IndexWriter.MaxFieldLength.UNLIMITED);
          [javac] ^
          [javac] C:\Documents and Settings\Swarag Segu\workspace\solrSrc\src\java\org\apache\solr\spelling\FileBasedSpellChecker.java:96: internal error; cannot instantiate org.apache.lucene.index.IndexWriter.<init> at org.apache.lucene.index.IndexWriter to ()
          [javac] IndexWriter writer = new IndexWriter(ramDir, fieldType.getAnalyzer(),
          [javac] ^
          [javac] Note: Some input files use or override a deprecated API.
          [javac] Note: Recompile with -Xlint:deprecation for details.
          [javac] Note: Some input files use unchecked or unsafe operations.
          [javac] Note: Recompile with -Xlint:unchecked for details.
          [javac] 2 errors

          Am I missing something?
          Thanks,
          Swarag.

          Show
          Swarag Segu added a comment - Hey Guys, I installed the latest patch and it gives me compile errors : compile: [mkdir] Created dir: C:\Documents and Settings\Swarag Segu\workspace\solrSrc\build\core [javac] Compiling 324 source files to C:\Documents and Settings\Swarag Segu\workspace\solrSrc\build\core [javac] C:\Documents and Settings\Swarag Segu\workspace\solrSrc\src\java\org\apache\solr\spelling\FileBasedSpellChecker.java:97: cannot find symbol [javac] symbol : variable MaxFieldLength [javac] location: class org.apache.lucene.index.IndexWriter [javac] true, IndexWriter.MaxFieldLength.UNLIMITED); [javac] ^ [javac] C:\Documents and Settings\Swarag Segu\workspace\solrSrc\src\java\org\apache\solr\spelling\FileBasedSpellChecker.java:96: internal error; cannot instantiate org.apache.lucene.index.IndexWriter.<init> at org.apache.lucene.index.IndexWriter to () [javac] IndexWriter writer = new IndexWriter(ramDir, fieldType.getAnalyzer(), [javac] ^ [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 2 errors Am I missing something? Thanks, Swarag.
          Hide
          Grant Ingersoll added a comment -

          What version of Lucene do you have in your lib directory? Try svn up
          from the root of Solr trunk.

          Show
          Grant Ingersoll added a comment - What version of Lucene do you have in your lib directory? Try svn up from the root of Solr trunk.
          Hide
          Grant Ingersoll added a comment -

          Small mod to move name and name init up to the SolrSpellChecker abstract class, since name is common to all spellers.

          Show
          Grant Ingersoll added a comment - Small mod to move name and name init up to the SolrSpellChecker abstract class, since name is common to all spellers.
          Hide
          Grant Ingersoll added a comment -

          Removes "unmodifiableMap" factor from the suggestions/token freqs. Rethinking this, I think it is reasonable to think that someone would want to modify these (or insert directly)

          Show
          Grant Ingersoll added a comment - Removes "unmodifiableMap" factor from the suggestions/token freqs. Rethinking this, I think it is reasonable to think that someone would want to modify these (or insert directly)
          Hide
          Oleg Gnatovskiy added a comment -

          Do these latest patches require Lucene 2.4? Would it be better to stay with 2.3.1?

          Show
          Oleg Gnatovskiy added a comment - Do these latest patches require Lucene 2.4? Would it be better to stay with 2.3.1?
          Hide
          Grant Ingersoll added a comment -

          Do these latest patches require Lucene 2.4? Would it be better to stay with 2.3.1?

          They require what is checked into Solr's lib directory, which is Lucene's trunk as of yesterday. There are actually a few changes in Lucene's spell checker that I think are worth having in 2.4. Additionally, I think we will want LUCENE-1297 before we are through, which is probably another configuration item. However, that can be added later, unless Otis commits it fairly soon.

          Show
          Grant Ingersoll added a comment - Do these latest patches require Lucene 2.4? Would it be better to stay with 2.3.1? They require what is checked into Solr's lib directory, which is Lucene's trunk as of yesterday. There are actually a few changes in Lucene's spell checker that I think are worth having in 2.4. Additionally, I think we will want LUCENE-1297 before we are through, which is probably another configuration item. However, that can be added later, unless Otis commits it fairly soon.
          Hide
          Swarag Segu added a comment -

          Hey guys. Installed the latest patch. Old problem is still there. For example if I do q=pizzzzza I get:

          <lst name="spellcheck">

          <lst name="suggestions">

          <lst name="pizzza">
          <int name="numFound">1</int>
          <int name="startOffset">0</int>
          <int name="endOffset">6</int>

          <arr name="suggestion">
          <str>pizza</str>
          </arr>
          </lst>
          </lst>
          </lst>

          Which is good. Then I do q=golf (golf is in the dictionary)

          lst name="spellcheck">

          <lst name="spellcheck">

          <lst name="suggestions">

          <lst name="golf">
          <int name="numFound">1</int>
          <int name="startOffset">0</int>
          <int name="endOffset">4</int>

          <arr name="suggestion">
          <str>roof</str>
          </arr>
          </lst>
          </lst>
          </lst>

          I don't think it should give me that suggestion. If a word is in the dictionary it should not give any suggestions. Am I right?

          Show
          Swarag Segu added a comment - Hey guys. Installed the latest patch. Old problem is still there. For example if I do q=pizzzzza I get: <lst name="spellcheck"> <lst name="suggestions"> <lst name="pizzza"> <int name="numFound">1</int> <int name="startOffset">0</int> <int name="endOffset">6</int> <arr name="suggestion"> <str>pizza</str> </arr> </lst> </lst> </lst> Which is good. Then I do q=golf (golf is in the dictionary) lst name="spellcheck"> <lst name="spellcheck"> − <lst name="suggestions"> − <lst name="golf"> <int name="numFound">1</int> <int name="startOffset">0</int> <int name="endOffset">4</int> − <arr name="suggestion"> <str>roof</str> </arr> </lst> </lst> </lst> I don't think it should give me that suggestion. If a word is in the dictionary it should not give any suggestions. Am I right?
          Hide
          Grant Ingersoll added a comment -

          I don't think it should give me that suggestion. If a word is in the dictionary it should not give any suggestions. Am I right?

          Possibly. I think it should give a better suggestion if one exists (i.e. more frequent) but otherwise, yes, it shouldn't give any suggestion. For your example, I would argree that it should not return a suggestion (assuming golf is in the dictionary). For example, the index could contain the words gilf and golf, with gilf having a freq. of 1 and golf having a freq of 100000. If the user enters gilf, I think it is reasonable to assume that the suggestion should be golf, even though gilf exists.

          Not saying this is supported yet, or anything, but just laying out the case.

          Show
          Grant Ingersoll added a comment - I don't think it should give me that suggestion. If a word is in the dictionary it should not give any suggestions. Am I right? Possibly. I think it should give a better suggestion if one exists (i.e. more frequent) but otherwise, yes, it shouldn't give any suggestion. For your example, I would argree that it should not return a suggestion (assuming golf is in the dictionary). For example, the index could contain the words gilf and golf, with gilf having a freq. of 1 and golf having a freq of 100000. If the user enters gilf, I think it is reasonable to assume that the suggestion should be golf, even though gilf exists. Not saying this is supported yet, or anything, but just laying out the case.
          Hide
          Oleg Gnatovskiy added a comment -

          I think that lower frequency suggestions should be optional. Some users might only want to offer suggestions for misspelled words (words not in the dictionary). Would it be hard to check if the query term exists in the dictionary before returning a suggestion?

          Show
          Oleg Gnatovskiy added a comment - I think that lower frequency suggestions should be optional. Some users might only want to offer suggestions for misspelled words (words not in the dictionary). Would it be hard to check if the query term exists in the dictionary before returning a suggestion?
          Hide
          Grant Ingersoll added a comment -

          Would it be hard to check if the query term exists in the dictionary before returning a suggestion?

          I'd have to double check, but I think the Lucene SC already does this in some cases (onlyMorePopular????)

          Show
          Grant Ingersoll added a comment - Would it be hard to check if the query term exists in the dictionary before returning a suggestion? I'd have to double check, but I think the Lucene SC already does this in some cases (onlyMorePopular????)
          Hide
          Otis Gospodnetic added a comment -

          I think the frequency awareness may be interesting. What happens if "gilf" has a frequency of 95K and "golf" a freq of 100K? Do we need this to become a SCRH config setting expressed as a percentage? (e.g. "Show alternative word suggestions even if the input word exists in the index iff freq(input word)/freq(suggested word)*100 < N%?)

          Show
          Otis Gospodnetic added a comment - I think the frequency awareness may be interesting. What happens if "gilf" has a frequency of 95K and "golf" a freq of 100K? Do we need this to become a SCRH config setting expressed as a percentage? (e.g. "Show alternative word suggestions even if the input word exists in the index iff freq(input word)/freq(suggested word)*100 < N%?)
          Hide
          Grant Ingersoll added a comment -

          Adds collation
          Slight change in SpellingResult results to take advantage of a LinkedHashMap and to explicitly state in the contract that spelling suggestions are in order by best suggestion first.

          Also added some more javadocs. Getting much closer. I'd like to see LUCENE-1297 addressed and committed so it could be used in the Lucene SCs.

          I've used this API to implement my own spell checker, too, so I'm pretty happy w/ the API if others are. I'd like to commit in the next week or so, so if people can check it out, kick the tires, that would be great.

          Show
          Grant Ingersoll added a comment - Adds collation Slight change in SpellingResult results to take advantage of a LinkedHashMap and to explicitly state in the contract that spelling suggestions are in order by best suggestion first. Also added some more javadocs. Getting much closer. I'd like to see LUCENE-1297 addressed and committed so it could be used in the Lucene SCs. I've used this API to implement my own spell checker, too, so I'm pretty happy w/ the API if others are. I'd like to commit in the next week or so, so if people can check it out, kick the tires, that would be great.
          Hide
          Grant Ingersoll added a comment -

          Minor change to only return the collation if it is different from the original

          Show
          Grant Ingersoll added a comment - Minor change to only return the collation if it is different from the original
          Hide
          Grant Ingersoll added a comment -

          Make getSpellChecker protected, add in JMX Stuff. Handle if the SpellingResult is null

          Show
          Grant Ingersoll added a comment - Make getSpellChecker protected, add in JMX Stuff. Handle if the SpellingResult is null
          Hide
          Sean Timm added a comment -

          It doesn't appear that you can get both extendedResults and count > 1. With the below URL, I get 1 suggestion for each misspelled term regardless of the value of spellcheck.count. If I set spellcheck.extendedResults=false, then I get the requested three suggestions for each term.

          /solr/spellCheckCompRH/?q=waz+designatd+two+bee+Arvil+25+bye+Pres.+it+waz&version=2.2&start=0&rows=2&indent=on&spellcheck=true&fl=title,url,id,categories,score&hl=on&hl.fl=body&qt=dismax&spellcheck.extendedResults=true&spellcheck.count=3
          
          Show
          Sean Timm added a comment - It doesn't appear that you can get both extendedResults and count > 1. With the below URL, I get 1 suggestion for each misspelled term regardless of the value of spellcheck.count. If I set spellcheck.extendedResults=false, then I get the requested three suggestions for each term. /solr/spellCheckCompRH/?q=waz+designatd+two+bee+Arvil+25+bye+Pres.+it+waz&version=2.2&start=0&rows=2&indent=on&spellcheck=true&fl=title,url,id,categories,score&hl=on&hl.fl=body&qt=dismax&spellcheck.extendedResults=true&spellcheck.count=3
          Hide
          Erik Hatcher added a comment -

          the spell checker component handling build/reload seems highly awkward to me. suggestion component really should just do that... and wrap the other operations as a /spellchecker/rebuild kinda thing and not even necessarily componentize those operations since they don't really necessarily need to be hooked together with other operations as a single request.

          anyway, just the overloading of a "component" to do managerial operations seems awkward. food for thought. not a -1 kinda thing though.

          Show
          Erik Hatcher added a comment - the spell checker component handling build/reload seems highly awkward to me. suggestion component really should just do that... and wrap the other operations as a /spellchecker/rebuild kinda thing and not even necessarily componentize those operations since they don't really necessarily need to be hooked together with other operations as a single request. anyway, just the overloading of a "component" to do managerial operations seems awkward. food for thought. not a -1 kinda thing though.
          Hide
          Grant Ingersoll added a comment -

          the spell checker component handling build/reload seems highly awkward to me. suggestion component really should just do that... and wrap the other operations as a /spellchecker/rebuild kinda thing and not even necessarily componentize those operations since they don't really necessarily need to be hooked together with other operations as a single request.

          I've thought about a bit, too, as it bothers me, too, but I think the initialization, etc. gets a bit tricky, like all Solr initialization. Not sure what to do.

          Show
          Grant Ingersoll added a comment - the spell checker component handling build/reload seems highly awkward to me. suggestion component really should just do that... and wrap the other operations as a /spellchecker/rebuild kinda thing and not even necessarily componentize those operations since they don't really necessarily need to be hooked together with other operations as a single request. I've thought about a bit, too, as it bothers me, too, but I think the initialization, etc. gets a bit tricky, like all Solr initialization. Not sure what to do.
          Hide
          Grant Ingersoll added a comment -

          Sean,

          I see the issue and am working on it. Good catch. I'll have a patch shortly.

          Show
          Grant Ingersoll added a comment - Sean, I see the issue and am working on it. Good catch. I'll have a patch shortly.
          Hide
          Grant Ingersoll added a comment -

          Fixes Sean's issue w/ extended results.

          Also, slightly modified the extended results results. See the

          testExtendedResultsCount()
          

          in SpellCheckComponentTest for the new format. Basically, though it tries to normalize the map entries so that one can ask for specific things by name,.

          Show
          Grant Ingersoll added a comment - Fixes Sean's issue w/ extended results. Also, slightly modified the extended results results. See the testExtendedResultsCount() in SpellCheckComponentTest for the new format. Basically, though it tries to normalize the map entries so that one can ask for specific things by name,.
          Hide
          Grant Ingersoll added a comment -

          OK, I'd like to commit this tomorrow or Wednesday. I am going to open another issue to bring in LUCENE-1297 to the configuration

          Show
          Grant Ingersoll added a comment - OK, I'd like to commit this tomorrow or Wednesday. I am going to open another issue to bring in LUCENE-1297 to the configuration
          Hide
          Yonik Seeley added a comment -

          For those who are just casually following this issue, is there a good summary of current input options and example output?

          Show
          Yonik Seeley added a comment - For those who are just casually following this issue, is there a good summary of current input options and example output?
          Hide
          Shalin Shekhar Mangar added a comment -

          A few questions/comments:

          1. Why is a WhiteSpaceTokenizer being used for tokenizing the value for a spellcheck.q parameter? Wouldn't it be more correct to use the query analyzer if the index is being built from a Solr field?
          2. The above argument also applies to queryAnalyzerFieldType which is being used for QueryConverter.
          3. I see that we can specify our own query converter through the queryConverter section in solrconfig.xml. But the SpellCheckComponent uses SpellingQueryConverter directly instead of an interface. We should add a QueryConvertor interface if this needs to be pluggable.
          4. If name is omitted from two dictionaries in solrconfig.xml then both get named as Default from the SolrSpellChecker#init method and they overwrite each other in the spellCheckers map
          5. How about building the index in the inform() method? I understand that the users can build the index using spellcheck.build=true and they can also use QuerySenderListener to build the index but this limits the user to use FSDirectory because if we use RAMDirectory and solr is restarted, the QuerySenderListener never fires and spell checker is left with no index. It's not a major inconvenience to use FSDirectory always but then RAMDirectory doesn't bring much to the table.
          Show
          Shalin Shekhar Mangar added a comment - A few questions/comments: Why is a WhiteSpaceTokenizer being used for tokenizing the value for a spellcheck.q parameter? Wouldn't it be more correct to use the query analyzer if the index is being built from a Solr field? The above argument also applies to queryAnalyzerFieldType which is being used for QueryConverter. I see that we can specify our own query converter through the queryConverter section in solrconfig.xml. But the SpellCheckComponent uses SpellingQueryConverter directly instead of an interface. We should add a QueryConvertor interface if this needs to be pluggable. If name is omitted from two dictionaries in solrconfig.xml then both get named as Default from the SolrSpellChecker#init method and they overwrite each other in the spellCheckers map How about building the index in the inform() method? I understand that the users can build the index using spellcheck.build=true and they can also use QuerySenderListener to build the index but this limits the user to use FSDirectory because if we use RAMDirectory and solr is restarted, the QuerySenderListener never fires and spell checker is left with no index. It's not a major inconvenience to use FSDirectory always but then RAMDirectory doesn't bring much to the table.
          Hide
          Grant Ingersoll added a comment -

          Why is a WhiteSpaceTokenizer being used for tokenizing the value for a spellcheck.q parameter? Wouldn't it be more correct to use the query analyzer if the index is being built from a Solr field?

          The above argument also applies to queryAnalyzerFieldType which is being used for QueryConverter

          My understanding was that the sc.q parameter was already analyzed and ready to be checked, thus all it needed was a conversion to tokens. As for the queryAnalyzerFieldType, that assumes the implementation is the IndexBasedSpellChecker or some other field based one that the SpellCheckComponent doesn't have access to, thus my reasoning that it needs to be handled separately and explicitly, which is why it isn't a part of the spellchecker configuration.

          I see that we can specify our own query converter through the queryConverter section in solrconfig.xml. But the SpellCheckComponent uses SpellingQueryConverter directly instead of an interface. We should add a QueryConvertor interface if this needs to be pluggable.

          I thought about making it an abstract base class, but in my mind it is really easy to override the SpellingQueryConverter and the component should know how to deal with it.

          If name is omitted from two dictionaries in solrconfig.xml then both get named as Default from the SolrSpellChecker#init method and they overwrite each other in the spellCheckers map

          Hmm, not good. I will fix.

          How about building the index in the inform() method? I understand that the users can build the index using spellcheck.build=true and they can also use QuerySenderListener to build the index but this limits the user to use FSDirectory because if we use RAMDirectory and solr is restarted, the QuerySenderListener never fires and spell checker is left with no index. It's not a major inconvenience to use FSDirectory always but then RAMDirectory doesn't bring much to the table.

          I think this gets back to our early discussions about it not working in inform b/c we don't have the reader at that point, or something like that. I really don't know the right answer, but do feel free to try it out. I do think it belongs in inform, but not sure if Solr is ready at that point. As for the QuerySenderListener, seems like it should fire if it is restarted, but I admit I don't know a whole lot about that functionality.

          Show
          Grant Ingersoll added a comment - Why is a WhiteSpaceTokenizer being used for tokenizing the value for a spellcheck.q parameter? Wouldn't it be more correct to use the query analyzer if the index is being built from a Solr field? The above argument also applies to queryAnalyzerFieldType which is being used for QueryConverter My understanding was that the sc.q parameter was already analyzed and ready to be checked, thus all it needed was a conversion to tokens. As for the queryAnalyzerFieldType, that assumes the implementation is the IndexBasedSpellChecker or some other field based one that the SpellCheckComponent doesn't have access to, thus my reasoning that it needs to be handled separately and explicitly, which is why it isn't a part of the spellchecker configuration. I see that we can specify our own query converter through the queryConverter section in solrconfig.xml. But the SpellCheckComponent uses SpellingQueryConverter directly instead of an interface. We should add a QueryConvertor interface if this needs to be pluggable. I thought about making it an abstract base class, but in my mind it is really easy to override the SpellingQueryConverter and the component should know how to deal with it. If name is omitted from two dictionaries in solrconfig.xml then both get named as Default from the SolrSpellChecker#init method and they overwrite each other in the spellCheckers map Hmm, not good. I will fix. How about building the index in the inform() method? I understand that the users can build the index using spellcheck.build=true and they can also use QuerySenderListener to build the index but this limits the user to use FSDirectory because if we use RAMDirectory and solr is restarted, the QuerySenderListener never fires and spell checker is left with no index. It's not a major inconvenience to use FSDirectory always but then RAMDirectory doesn't bring much to the table. I think this gets back to our early discussions about it not working in inform b/c we don't have the reader at that point, or something like that. I really don't know the right answer, but do feel free to try it out. I do think it belongs in inform, but not sure if Solr is ready at that point. As for the QuerySenderListener, seems like it should fire if it is restarted, but I admit I don't know a whole lot about that functionality.
          Hide
          Grant Ingersoll added a comment -

          Fix for the default name issue, add a test for it.

          Show
          Grant Ingersoll added a comment - Fix for the default name issue, add a test for it.
          Hide
          Grant Ingersoll added a comment -

          Thought some more about the comment about the QueryConverter, and decided to abstract it as Shalin suggests.

          Show
          Grant Ingersoll added a comment - Thought some more about the comment about the QueryConverter, and decided to abstract it as Shalin suggests.
          Hide
          Shalin Shekhar Mangar added a comment -

          Changes

          1. Moved Analyzer from AbstractLuceneSpellChecker to SolrSpellChecker since some form of query-time analysis would probably be needed for all spell checker implementations. Added a getQueryAnalyzer() method in SolrSpellChecker.
          2. Value specified for spellcheck.q is analyzed using the query analyzer for the dictionary as per the config (using the SolrSpellChecker.getQueryAnalyzer). The value for "q" will continue to be analyzed by QueryConvertor.
          3. Removed the EncodedTextDictionary class. Now that we're using the lucene-2.4 spellchecker, it is no longer needed because the previously protected constructor of PlainTextDictionary is made public in 2.4
          4. Added org.apache.solr.spelling to package list which can be searched by SolrResourceLoader. Now we can write solr.IndexBasedSpellChecker instead of the fully qualified class name.
          5. "classname" attribute in configuration is optional now, it defaults to IndexBasedSpellChecker
          6. Minor additions to javadocs
          Show
          Shalin Shekhar Mangar added a comment - Changes Moved Analyzer from AbstractLuceneSpellChecker to SolrSpellChecker since some form of query-time analysis would probably be needed for all spell checker implementations. Added a getQueryAnalyzer() method in SolrSpellChecker. Value specified for spellcheck.q is analyzed using the query analyzer for the dictionary as per the config (using the SolrSpellChecker.getQueryAnalyzer). The value for "q" will continue to be analyzed by QueryConvertor. Removed the EncodedTextDictionary class. Now that we're using the lucene-2.4 spellchecker, it is no longer needed because the previously protected constructor of PlainTextDictionary is made public in 2.4 Added org.apache.solr.spelling to package list which can be searched by SolrResourceLoader. Now we can write solr.IndexBasedSpellChecker instead of the fully qualified class name. "classname" attribute in configuration is optional now, it defaults to IndexBasedSpellChecker Minor additions to javadocs
          Hide
          Grant Ingersoll added a comment -

          Committed revision 669485. Note, I incorporated LUCENE-1297. See http://wiki.apache.org/solr/SpellCheckComponent for more details on how to use it, as well as the unit tests.

          Thanks to all who helped/contributed.

          Show
          Grant Ingersoll added a comment - Committed revision 669485. Note, I incorporated LUCENE-1297 . See http://wiki.apache.org/solr/SpellCheckComponent for more details on how to use it, as well as the unit tests. Thanks to all who helped/contributed.
          Hide
          Noble Paul added a comment -

          Why do we need to add the queryConverter definition outside of the speallcheck search component?
          Is it going to be used by any other component other than this?

          Show
          Noble Paul added a comment - Why do we need to add the queryConverter definition outside of the speallcheck search component? Is it going to be used by any other component other than this?
          Hide
          Grant Ingersoll added a comment - - edited

          Because of the stupid way it gets initialized as a
          NamedListInitializerWhateverWhatever. I'm open to alternate
          suggestions on how to do it and take advantage of the resource loader,
          etc.

          Every time I go to do initialization stuff in Solr these days I pine
          for Spring, since we are basically re-inventing it, albeit not as
          nicely.

          -Grant

          Show
          Grant Ingersoll added a comment - - edited Because of the stupid way it gets initialized as a NamedListInitializerWhateverWhatever. I'm open to alternate suggestions on how to do it and take advantage of the resource loader, etc. Every time I go to do initialization stuff in Solr these days I pine for Spring, since we are basically re-inventing it, albeit not as nicely. -Grant
          Hide
          Geoffrey Young added a comment - - edited

          I'm seeing random weirdness in the collation results. the same query shift-refreshed sometimes yields (in json)

          {
           "responseHeader":{
              "params":{
          	"spellcheck":"true",
          	"q":"redbull air show",
          	"qf":"search-en",
          	"spellcheck.collate":"true",
          	"qt":"dismax",
          	"wt":"json",
          	"rows":"0"}},
           "response":{"numFound":0,"start":0,"docs":[]
           },
           "spellcheck":{
            "suggestions":[
          	"redbull",[
          	 "numFound",1,
          	 "startOffset",0,
          	 "endOffset",7,
          	 "suggestion",["redbelly"]],
          	"show",[
          	 "numFound",1,
          	 "startOffset",12,
          	 "endOffset",16,
          	 "suggestion",["shot"]],
          	"collation","redbelly airshotw"]}}
          

          note the "collation" spacing and extraneous 'w'. a refresh toggles between that and what you might expect :

          "collation","redbelly air shot"]
          

          UPDATE: opened new issue as SOLR-606

          --Geoff

          Show
          Geoffrey Young added a comment - - edited I'm seeing random weirdness in the collation results. the same query shift-refreshed sometimes yields (in json) { "responseHeader":{ "params":{ "spellcheck":"true", "q":"redbull air show", "qf":"search-en", "spellcheck.collate":"true", "qt":"dismax", "wt":"json", "rows":"0"}}, "response":{"numFound":0,"start":0,"docs":[] }, "spellcheck":{ "suggestions":[ "redbull",[ "numFound",1, "startOffset",0, "endOffset",7, "suggestion",["redbelly"]], "show",[ "numFound",1, "startOffset",12, "endOffset",16, "suggestion",["shot"]], "collation","redbelly airshotw"]}} note the "collation" spacing and extraneous 'w'. a refresh toggles between that and what you might expect : "collation","redbelly air shot"] UPDATE: opened new issue as SOLR-606 --Geoff
          Hide
          Grant Ingersoll added a comment -

          Can you open a new issue to track this? Looks like a string replace issue on the offsets. We probably should do the collation a bit differently to make sure the words fit right. We'll probably have to right pad or something like that.

          Show
          Grant Ingersoll added a comment - Can you open a new issue to track this? Looks like a string replace issue on the offsets. We probably should do the collation a bit differently to make sure the words fit right. We'll probably have to right pad or something like that.
          Hide
          Sean Timm added a comment -

          For what it is worth, here is the code that I used client side before the collation feature was available. I haven't looked at how it is done in this patch. It has some nice features such as delimiting the spelling correction, e.g., with HTML bold tags, and preserving the users initial case on each word.

                  StringBuilder buff = new StringBuilder();
                  StringBuilder rawBuff = new StringBuilder();
                  int last = 0;
                  String userStr = null;
                  // for each suggestion
                  for( Suggestion s : suggestions ) {
                      // add part before the mispelling
                      userStr = userQuery.substring( last, s.startOffset );
                      buff.append( userStr );
                      rawBuff.append( userStr );
                      String suggestion = s.suggestion;
                      if( _spellCheckPreserveUserCase ) {
                          userStr = userQuery.substring( s.startOffset, s.endOffset );
                          char[] userCh = userStr.toCharArray();
                          boolean initialUpper = Character.isUpperCase( userCh[0] );
                          boolean allUpper = true;
                          for( char c : userCh ) {
                              if( Character.isLowerCase( c ) ) {
                                  allUpper = false;
                                  break;
                              }
                          }
                          if( allUpper ) {
                              suggestion = suggestion.toUpperCase();
                          }
                          else if( initialUpper ) {
                              userCh = suggestion.toCharArray();
                              userCh[0] = Character.toUpperCase( userCh[0] );
                              suggestion = new String( userCh );
                          }
                      }
                      buff.append( _spellCheckStartHighlight ).append( suggestion )
                          .append( _spellCheckEndHighlight );
                      rawBuff.append( suggestion );
                      last = s.endOffset;
                  }
                  // add part after all mispellings
                  userStr = userQuery.substring( last );
                  buff.append( userStr );
                  rawBuff.append( userStr );
                  if( log().isDebugEnabled() ) {
                      log().debug( "Did you mean: " + buff );
                      log().debug( "Did you mean link: " + rawBuff );
                  }
          
          Show
          Sean Timm added a comment - For what it is worth, here is the code that I used client side before the collation feature was available. I haven't looked at how it is done in this patch. It has some nice features such as delimiting the spelling correction, e.g., with HTML bold tags, and preserving the users initial case on each word. StringBuilder buff = new StringBuilder(); StringBuilder rawBuff = new StringBuilder(); int last = 0; String userStr = null ; // for each suggestion for ( Suggestion s : suggestions ) { // add part before the mispelling userStr = userQuery.substring( last, s.startOffset ); buff.append( userStr ); rawBuff.append( userStr ); String suggestion = s.suggestion; if ( _spellCheckPreserveUserCase ) { userStr = userQuery.substring( s.startOffset, s.endOffset ); char [] userCh = userStr.toCharArray(); boolean initialUpper = Character .isUpperCase( userCh[0] ); boolean allUpper = true ; for ( char c : userCh ) { if ( Character .isLowerCase( c ) ) { allUpper = false ; break ; } } if ( allUpper ) { suggestion = suggestion.toUpperCase(); } else if ( initialUpper ) { userCh = suggestion.toCharArray(); userCh[0] = Character .toUpperCase( userCh[0] ); suggestion = new String ( userCh ); } } buff.append( _spellCheckStartHighlight ).append( suggestion ) .append( _spellCheckEndHighlight ); rawBuff.append( suggestion ); last = s.endOffset; } // add part after all mispellings userStr = userQuery.substring( last ); buff.append( userStr ); rawBuff.append( userStr ); if ( log().isDebugEnabled() ) { log().debug( "Did you mean: " + buff ); log().debug( "Did you mean link: " + rawBuff ); }
          Hide
          Bojan Smid added a comment -

          I notice that old pizza->plaza, golf->roof issue is still here.

          I created a patch for latest trunk version which deals with this, here is the attachment, I believe the fix should be submitted (maybe it should be implemented differently, but that's open for the discussion, I used spellchecker.exist() method).

          Show
          Bojan Smid added a comment - I notice that old pizza->plaza, golf->roof issue is still here. I created a patch for latest trunk version which deals with this, here is the attachment, I believe the fix should be submitted (maybe it should be implemented differently, but that's open for the discussion, I used spellchecker.exist() method).
          Hide
          Grant Ingersoll added a comment -

          Hi Bojan,

          Thanks for the patch. I think it would be best to open a new issue for it.

          However, I'm not sure what is going on here. When I look at the Lucene code, it has this:

          final int freq = (ir != null && field != null) ? ir.docFreq(new Term(field, word)) : 0;
          final int goalFreq = (morePopular && ir != null && field != null) ? freq : 0;
          // if the word exists in the real index and we don't care for word frequency, return the word itself
              if (!morePopular && freq > 0) {
                return new String[] { word };
              }
          

          The comment says it all, so maybe we have something else going on wrong.

          At a minimum, your patch at least needs to account for when you want to get more popular suggestions even if the word exists.

          Show
          Grant Ingersoll added a comment - Hi Bojan, Thanks for the patch. I think it would be best to open a new issue for it. However, I'm not sure what is going on here. When I look at the Lucene code, it has this: final int freq = (ir != null && field != null ) ? ir.docFreq( new Term(field, word)) : 0; final int goalFreq = (morePopular && ir != null && field != null ) ? freq : 0; // if the word exists in the real index and we don't care for word frequency, return the word itself if (!morePopular && freq > 0) { return new String [] { word }; } The comment says it all, so maybe we have something else going on wrong. At a minimum, your patch at least needs to account for when you want to get more popular suggestions even if the word exists.

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Shalin Shekhar Mangar
            • Votes:
              4 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development