Solr
  1. Solr
  2. SOLR-2761

FSTLookup should use long-tail like discretization instead of proportional (linear)

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: 3.4
    • Fix Version/s: 3.5, 3.6, 4.0-ALPHA
    • Component/s: spellchecker
    • Labels:
      None

      Description

      The Suggester's FSTLookup implementation discretizes the term frequencies into a configurable number of buckets (configurable as "weightBuckets") in order to deal with FST limitations. The mapping of a source frequency into a bucket is a proportional (i.e. linear) mapping from the minimum and maximum value. I don't think this makes sense at all given the well-known long-tail like distribution of term frequencies. As a result of this problem, I've found it necessary to increase weightBuckets substantially, like >100, to get quality suggestions.

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Dawid Weiss
              Reporter:
              David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development