Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-2761

FSTLookup should use long-tail like discretization instead of proportional (linear)

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Duplicate
    • 3.4
    • 3.5, 3.6, 4.0-ALPHA
    • spellchecker
    • None

    Description

      The Suggester's FSTLookup implementation discretizes the term frequencies into a configurable number of buckets (configurable as "weightBuckets") in order to deal with FST limitations. The mapping of a source frequency into a bucket is a proportional (i.e. linear) mapping from the minimum and maximum value. I don't think this makes sense at all given the well-known long-tail like distribution of term frequencies. As a result of this problem, I've found it necessary to increase weightBuckets substantially, like >100, to get quality suggestions.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            dweiss Dawid Weiss
            dsmiley David Smiley
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment