Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-2010

Improvements to SpellCheckComponent Collate functionality

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.4.1
    • 3.1, 4.0-ALPHA
    • None
    • Tested against trunk revision 966633

    Description

      Improvements to SpellCheckComponent Collate functionality

      Our project requires a better Spell Check Collator. I'm contributing this as a patch to get suggestions for improvements and in case there is a broader need for these features.

      1. Only return collations that are guaranteed to result in hits if re-queried (applying original fq params also). This is especially helpful when there is more than one correction per query. The 1.4 behavior does not verify that a particular combination will actually return hits.
      2. Provide the option to get multiple collation suggestions
      3. Provide extended collation results including the # of hits re-querying will return and a breakdown of each misspelled word and its correction.

      This patch is similar to what is described in SOLR-507 item #1. Also, this patch provides a viable workaround for the problem discussed in SOLR-1074. A dictionary could be created that combines the terms from the multiple fields. The collator then would prune out any spurious suggestions this would cause.

      This patch adds the following spellcheck parameters:

      1. spellcheck.maxCollationTries - maximum # of collation possibilities to try before giving up. Lower values ensure better performance. Higher values may be necessary to find a collation that can return results. Default is 0, which maintains backwards-compatible behavior (do not check collations).

      2. spellcheck.maxCollations - maximum # of collations to return. Default is 1, which maintains backwards-compatible behavior.

      3. spellcheck.collateExtendedResult - if true, returns an expanded response format detailing collations found. default is false, which maintains backwards-compatible behavior. When true, output is like this (in context):

      <lst name="spellcheck">
      <lst name="suggestions">
      <lst name="hopq">
      <int name="numFound">94</int>
      <int name="startOffset">7</int>
      <int name="endOffset">11</int>
      <arr name="suggestion">
      <str>hope</str>
      <str>how</str>
      <str>hope</str>
      <str>chops</str>
      <str>hoped</str>
      etc
      </arr>
      <lst name="faill">
      <int name="numFound">100</int>
      <int name="startOffset">16</int>
      <int name="endOffset">21</int>
      <arr name="suggestion">
      <str>fall</str>
      <str>fails</str>
      <str>fail</str>
      <str>fill</str>
      <str>faith</str>
      <str>all</str>
      etc
      </arr>
      </lst>
      <lst name="collation">
      <str name="collationQuery">Title:(how AND fails)</str>
      <int name="hits">2</int>
      <lst name="misspellingsAndCorrections">
      <str name="hopq">how</str>
      <str name="faill">fails</str>
      </lst>
      </lst>
      <lst name="collation">
      <str name="collationQuery">Title:(hope AND faith)</str>
      <int name="hits">2</int>
      <lst name="misspellingsAndCorrections">
      <str name="hopq">hope</str>
      <str name="faill">faith</str>
      </lst>
      </lst>
      <lst name="collation">
      <str name="collationQuery">Title:(chops AND all)</str>
      <int name="hits">1</int>
      <lst name="misspellingsAndCorrections">
      <str name="hopq">chops</str>
      <str name="faill">all</str>
      </lst>
      </lst>
      </lst>
      </lst>

      In addition, SOLRJ is updated to include SpellCheckResponse.getCollatedResults(), which will return the expanded Collation format. getCollatedResult(), which returns a single String, is retained for backwards-compatibility. Other APIs were not changed but will still work provided that spellcheck.collateExtendedResult is false.

      This likely will not return valid results if using Shards. Rather, a more robust interaction with the index would be necessary than what exists in SpellCheckCollator.collate().

      Attachments

        1. SOLR-2010.patch
          39 kB
          James Dyer
        2. SOLR-2010.patch
          44 kB
          Grant Ingersoll
        3. SOLR-2010.txt
          68 kB
          James Dyer
        4. SOLR-2010.patch
          71 kB
          James Dyer
        5. SOLR-2010.patch
          70 kB
          James Dyer
        6. SOLR-2010_shardSearchHandler_993538.patch
          70 kB
          James Dyer
        7. SOLR-2010_shardRecombineCollations_993538.patch
          70 kB
          James Dyer
        8. SOLR-2010_shardSearchHandler_999521.patch
          70 kB
          James Dyer
        9. SOLR-2010_shardRecombineCollations_999521.patch
          70 kB
          James Dyer
        10. SOLR-2010_141.patch
          61 kB
          James Dyer
        11. solr_2010_3x.patch
          70 kB
          James Dyer
        12. SOLR-2010_141.patch
          55 kB
          James Dyer
        13. multiple_collations_as_an_array.patch
          7 kB
          James Dyer

        Issue Links

          Activity

            People

              gsingers Grant Ingersoll
              jdyer James Dyer
              Votes:
              5 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: