Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-2761

FSTLookup should use long-tail like discretization instead of proportional (linear)

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: 3.4
    • Fix Version/s: 3.5, 3.6, 4.0-ALPHA
    • Component/s: spellchecker
    • Labels:
      None

      Description

      The Suggester's FSTLookup implementation discretizes the term frequencies into a configurable number of buckets (configurable as "weightBuckets") in order to deal with FST limitations. The mapping of a source frequency into a bucket is a proportional (i.e. linear) mapping from the minimum and maximum value. I don't think this makes sense at all given the well-known long-tail like distribution of term frequencies. As a result of this problem, I've found it necessary to increase weightBuckets substantially, like >100, to get quality suggestions.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                dweiss Dawid Weiss
                Reporter:
                dsmiley David Smiley
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: