Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: spellchecker
    • Labels:
      None

      Description

      Creating a placeholder issue to track Spell Checking Improvements. Individual issues can later be created and linked for each area of separable concern when they are determined.

      Areas to discuss include:

      1. spell suggestions from within the current query (minus terms being corrected) and filter so that suggestions are always valid
        • need approaches to merging the spelling list with the current mask of valid records. Also, is this a better change to Lucene first, or something that belongs in Solr?
        • need to add spell checking as query component and make available to various query handlers
        • spell checking to be field specific to support responding correctly with dismax queries
      2. spell suggestions from a distributed search (SOLR-303) by augmenting the response, or alternatively just provide a federating of Spell Checker requests on their own and let the application decide when to use each.
      3. spell suggestions as a search component to augment other queries

      What are other typical areas of concern, or suggestions for improvements for spell checking that can be tracked?

      I am willing to look at driving a patch for this area, especially for spell checking working within the current result set, and across distributed search.

        Issue Links

          Activity

          Jayson Minard created issue -
          Jayson Minard made changes -
          Field Original Value New Value
          Description Creating a placeholder issue to track Spell Checking Improvements. Individual issues can later be created and linked for each area of separable concern when they are determined.

          Areas to discuss include:

          # spell checking only within the current result set so that suggestions are always valid
          ** need to merge the spell checking index structure into fields within the actual documents within the main index rather than using a parallel dictionary index (change to Lucene, or place in Solr?)
          ** need to add spell checking as query component and make available to various query handlers
          ** spell checking to be field specific to support responding correctly with dismax queries
          # spell checking in a distributed search (SOLR-303)

          What are other typical areas of concern, or suggestions for improvements for spell checking that can be tracked?

          I am willing to look at driving a patch for this area, especially for spell checking working within the current result set, and across distributed search.
          Creating a placeholder issue to track Spell Checking Improvements. Individual issues can later be created and linked for each area of separable concern when they are determined.

          Areas to discuss include:

          # spell suggestions from within the current query (minus terms being corrected) and filter so that suggestions are always valid
          ** need to merge the spell checking index structure into fields within the actual documents within the main index rather than using a parallel dictionary index (change to Lucene, or place in Solr?)
          ** need to add spell checking as query component and make available to various query handlers
          ** spell checking to be field specific to support responding correctly with dismax queries
          # spell suggestions from a distributed search (SOLR-303)
          # spell suggestions as a search component to augment other queries

          What are other typical areas of concern, or suggestions for improvements for spell checking that can be tracked?

          I am willing to look at driving a patch for this area, especially for spell checking working within the current result set, and across distributed search.
          Jayson Minard made changes -
          Description Creating a placeholder issue to track Spell Checking Improvements. Individual issues can later be created and linked for each area of separable concern when they are determined.

          Areas to discuss include:

          # spell suggestions from within the current query (minus terms being corrected) and filter so that suggestions are always valid
          ** need to merge the spell checking index structure into fields within the actual documents within the main index rather than using a parallel dictionary index (change to Lucene, or place in Solr?)
          ** need to add spell checking as query component and make available to various query handlers
          ** spell checking to be field specific to support responding correctly with dismax queries
          # spell suggestions from a distributed search (SOLR-303)
          # spell suggestions as a search component to augment other queries

          What are other typical areas of concern, or suggestions for improvements for spell checking that can be tracked?

          I am willing to look at driving a patch for this area, especially for spell checking working within the current result set, and across distributed search.
          Creating a placeholder issue to track Spell Checking Improvements. Individual issues can later be created and linked for each area of separable concern when they are determined.

          Areas to discuss include:

          # spell suggestions from within the current query (minus terms being corrected) and filter so that suggestions are always valid
          ** need approaches to merging the spelling list with the current mask of valid records. Also, is this a better change to Lucene first, or something that belongs in Solr?
          ** need to add spell checking as query component and make available to various query handlers
          ** spell checking to be field specific to support responding correctly with dismax queries
          # spell suggestions from a distributed search (SOLR-303)
          # spell suggestions as a search component to augment other queries

          What are other typical areas of concern, or suggestions for improvements for spell checking that can be tracked?

          I am willing to look at driving a patch for this area, especially for spell checking working within the current result set, and across distributed search.
          Hide
          Jayson Minard added a comment - - edited

          A related item from Lucene project...

          • LUCENE-626 "Extended spell checker with phrase support and adaptive user session analysis" provides phrase-level spell suggestions.

          And tracking comments about spell suggestion algorithms just in case this comes up:

          Show
          Jayson Minard added a comment - - edited A related item from Lucene project... LUCENE-626 "Extended spell checker with phrase support and adaptive user session analysis" provides phrase-level spell suggestions. And tracking comments about spell suggestion algorithms just in case this comes up: Spelling Checker using Lucene
          Hide
          Jayson Minard added a comment -

          Updated description to provide alternatives for distributed search.

          Show
          Jayson Minard added a comment - Updated description to provide alternatives for distributed search.
          Jayson Minard made changes -
          Description Creating a placeholder issue to track Spell Checking Improvements. Individual issues can later be created and linked for each area of separable concern when they are determined.

          Areas to discuss include:

          # spell suggestions from within the current query (minus terms being corrected) and filter so that suggestions are always valid
          ** need approaches to merging the spelling list with the current mask of valid records. Also, is this a better change to Lucene first, or something that belongs in Solr?
          ** need to add spell checking as query component and make available to various query handlers
          ** spell checking to be field specific to support responding correctly with dismax queries
          # spell suggestions from a distributed search (SOLR-303)
          # spell suggestions as a search component to augment other queries

          What are other typical areas of concern, or suggestions for improvements for spell checking that can be tracked?

          I am willing to look at driving a patch for this area, especially for spell checking working within the current result set, and across distributed search.
          Creating a placeholder issue to track Spell Checking Improvements. Individual issues can later be created and linked for each area of separable concern when they are determined.

          Areas to discuss include:

          # spell suggestions from within the current query (minus terms being corrected) and filter so that suggestions are always valid
          ** need approaches to merging the spelling list with the current mask of valid records. Also, is this a better change to Lucene first, or something that belongs in Solr?
          ** need to add spell checking as query component and make available to various query handlers
          ** spell checking to be field specific to support responding correctly with dismax queries
          # spell suggestions from a distributed search (SOLR-303) by augmenting the response, or alternatively just provide a federating of Spell Checker requests on their own and let the application decide when to use each.
          # spell suggestions as a search component to augment other queries

          What are other typical areas of concern, or suggestions for improvements for spell checking that can be tracked?

          I am willing to look at driving a patch for this area, especially for spell checking working within the current result set, and across distributed search.
          Hide
          Yonik Seeley added a comment -

          Spell checking is not an area I've personally looked at, but your list of discussion items looks spot on.
          IMO, since integrating spelling suggestions with general query results (search, facet, highlight) hasn't been done before in Solr, the response format is wide open (go crazy!)

          Show
          Yonik Seeley added a comment - Spell checking is not an area I've personally looked at, but your list of discussion items looks spot on. IMO, since integrating spelling suggestions with general query results (search, facet, highlight) hasn't been done before in Solr, the response format is wide open (go crazy!)
          Jayson Minard made changes -
          Link This issue relates to SOLR-303 [ SOLR-303 ]
          Hide
          Jayson Minard added a comment -

          Linking to related issue of distributed search.

          Show
          Jayson Minard added a comment - Linking to related issue of distributed search.
          Hide
          Shalin Shekhar Mangar added a comment -

          I have just finished implementing a SpellCheck library (using Lucene) for a project which was not already using Solr. I implemented a few ideas there which can be added to Solr.

          • Given a user query consisting of many words, return just one suggestion for the whole query e.g. search for "hybrd sedn" gives you "hybrid sedan" as a suggestion
          • Give me a suggestion on a per-field basis
          • Never give duplicate words in a suggestion e.g. My index contains "Mercedes-Benz" and user searches for "mercedec bens", he should not get a suggestion like "Mercedes-Benz Mercedes-Benz"
          • Don't try to give a suggestion for tokens less than a given length (my impl used 3). For a query like "mercedes e class" it avoids giving a suggestion like "mercedes e-class c-class"

          I understand that these tweaks are often very specific to the use-case, but we can atleast provide the features for people to use as they see fit. In order to implement the multiple-field support, we can change SpellCheckerRequestHandler to create HighFrequencyDictionary for each configured field and add them all to the spell check index. We can use the overloaded suggestSimilar method (which accepts field) to query. If this sounds fine, I can give a patch to add these features.

          Show
          Shalin Shekhar Mangar added a comment - I have just finished implementing a SpellCheck library (using Lucene) for a project which was not already using Solr. I implemented a few ideas there which can be added to Solr. Given a user query consisting of many words, return just one suggestion for the whole query e.g. search for "hybrd sedn" gives you "hybrid sedan" as a suggestion Give me a suggestion on a per-field basis Never give duplicate words in a suggestion e.g. My index contains "Mercedes-Benz" and user searches for "mercedec bens", he should not get a suggestion like "Mercedes-Benz Mercedes-Benz" Don't try to give a suggestion for tokens less than a given length (my impl used 3). For a query like "mercedes e class" it avoids giving a suggestion like "mercedes e-class c-class" I understand that these tweaks are often very specific to the use-case, but we can atleast provide the features for people to use as they see fit. In order to implement the multiple-field support, we can change SpellCheckerRequestHandler to create HighFrequencyDictionary for each configured field and add them all to the spell check index. We can use the overloaded suggestSimilar method (which accepts field) to query. If this sounds fine, I can give a patch to add these features.
          Hide
          Otis Gospodnetic added a comment -

          Shalin:
          This all sounds very good. Do you mind opening a new JIRA issue with this information, so you can attach a patch to that? Thanks.

          Show
          Otis Gospodnetic added a comment - Shalin: This all sounds very good. Do you mind opening a new JIRA issue with this information, so you can attach a patch to that? Thanks.
          Hide
          Shalin Shekhar Mangar added a comment -

          A new JIRA issue SOLR-572 is created for a Search Component for the Lucene contrib SpellChecker.

          Another suggestion:

          • Have a postCommit/postOptimize listener to (re)create spell checker index. Currently, the user needs to make an explicit query to build the spell index which can be easily automated.
          Show
          Shalin Shekhar Mangar added a comment - A new JIRA issue SOLR-572 is created for a Search Component for the Lucene contrib SpellChecker. Another suggestion: Have a postCommit/postOptimize listener to (re)create spell checker index. Currently, the user needs to make an explicit query to build the spell index which can be easily automated.
          Hide
          Otis Gospodnetic added a comment -

          Shalin,
          Good idea, I think, but only if the SC is being built from the modified index. One thing I'd like to add to SCRH, actually, is the ability to (re)build the SC index from a plain text file (via PlainTextDictionary class in Lucene's SC). In that case postCommit/postOptimize should not trigger SC index rebuilding.

          Show
          Otis Gospodnetic added a comment - Shalin, Good idea, I think, but only if the SC is being built from the modified index. One thing I'd like to add to SCRH, actually, is the ability to (re)build the SC index from a plain text file (via PlainTextDictionary class in Lucene's SC). In that case postCommit/postOptimize should not trigger SC index rebuilding.
          Hide
          Hoss Man added a comment -

          Have a postCommit/postOptimize listener to (re)create spell checker index. Currently, the user needs to make an explicit query to build the spell index which can be easily automated.

          it's not true that users need to make an explicit query: The QuerySenderListener can be used to automatically trigger the "request" to rebuild the index. the rebuild cmd was designed that way intentionally: so people could manually hit it if desired, or it could be automated as part of a postCommit or postOptimize hook.

          Show
          Hoss Man added a comment - Have a postCommit/postOptimize listener to (re)create spell checker index. Currently, the user needs to make an explicit query to build the spell index which can be easily automated. it's not true that users need to make an explicit query: The QuerySenderListener can be used to automatically trigger the "request" to rebuild the index. the rebuild cmd was designed that way intentionally: so people could manually hit it if desired, or it could be automated as part of a postCommit or postOptimize hook.
          Hide
          Shalin Shekhar Mangar added a comment -

          Hoss - I was not aware that I could use the QuerySenderListener in that way.

          Otis - I agree.

          Show
          Shalin Shekhar Mangar added a comment - Hoss - I was not aware that I could use the QuerySenderListener in that way. Otis - I agree.
          Shalin Shekhar Mangar made changes -
          Link This issue incorporates SOLR-572 [ SOLR-572 ]
          Grant Ingersoll made changes -
          Link This issue relates to SOLR-785 [ SOLR-785 ]
          Hide
          Grant Ingersoll added a comment -

          Most of these items are (or will be) fixed by other issues

          Show
          Grant Ingersoll added a comment - Most of these items are (or will be) fixed by other issues
          Grant Ingersoll made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Won't Fix [ 2 ]
          Grant Ingersoll made changes -
          Link This issue relates to SOLR-2010 [ SOLR-2010 ]
          Grant Ingersoll made changes -
          Link This issue relates to LUCENE-2479 [ LUCENE-2479 ]
          Grant Ingersoll made changes -
          Link This issue is related to LUCENE-2608 [ LUCENE-2608 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Jayson Minard
            • Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development