Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: None
    • Component/s: general/javadocs
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Added some javadocs that explains why the spellchecker does not work as one might expect it to.

      http://www.nabble.com/SpellChecker%3A%3AsuggestSimilar%28%29-Question-tf3118660.html#a8640395

      > Without having looked at the code for a long time, I think the problem is what the
      > lucene scoring consider to be best. First the grams are searched, resulting in a number
      > of hits. Then the edit-distance is calculated on each hit. "Genetics" is appearently the
      > third most similar hit according to Lucene, but the best according to Levenshtein.
      >
      > I.e. Lucene does not use edit-distance as similarity. You need to get a bunch of best hits
      > in order to find the one with the smallest edit-distance.

      I took a look at the code, and my assessment seems to be right.

        Activity

        Hide
        Karl Wettin added a comment -

        patch root is trunk/contrib/spellcheck

        Show
        Karl Wettin added a comment - patch root is trunk/contrib/spellcheck
        Hide
        Otis Gospodnetic added a comment -

        Applied, merci Karl.

        Show
        Otis Gospodnetic added a comment - Applied, merci Karl.
        Hide
        Karl Wettin added a comment -

        It might be noteworthy that the spell checker will gather numSug * 10 hits from the a priori corpus. I suppose that number (10) was something the original author came up with when testing. In most cases it is seems to be good enough. In my refactor I've introduced a method parameter for the factor. This is probably a better looking solution than telling the user to increase numSug, as numSug saves a few clock ticks when not adding a suggestion to the priority list.

        The javadocs should probaly state something like that instead.

        Show
        Karl Wettin added a comment - It might be noteworthy that the spell checker will gather numSug * 10 hits from the a priori corpus. I suppose that number (10) was something the original author came up with when testing. In most cases it is seems to be good enough. In my refactor I've introduced a method parameter for the factor. This is probably a better looking solution than telling the user to increase numSug, as numSug saves a few clock ticks when not adding a suggestion to the priority list. The javadocs should probaly state something like that instead.

          People

          • Assignee:
            Otis Gospodnetic
            Reporter:
            Karl Wettin
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development