Solr
  1. Solr
  2. SOLR-2571

IndexBasedSpellChecker "thresholdTokenFrequency" fails with a ClassCastException on startup

    Details

      Description

      When parsing the configuration for thresholdTokenFrequency", the IndexBasedSpellChecker tries to pull a Float from the DataConfig.xml-derrived NamedList. However, this comes through as a String. Therefore, a ClassCastException is always thrown whenever this parameter is specified. The code ought to be doing "Float.parseFloat(...)" on the value.

      This looks like a nice feature to use in cases the data contains misspelled or rare words leading to spurious "correct" queries. I would have liked to have used this with a project we just completed however this bug prevented that. This issue came up recently in the User's mailing list so I am raising an issue now.

      1. SOLR-2571.solr3.2.patch
        0.7 kB
        James Dyer
      2. SOLR-2571.patch
        0.7 kB
        James Dyer
      3. SOLR-2571.patch
        7 kB
        James Dyer
      4. SOLR-2571.patch
        7 kB
        James Dyer
      5. SOLR-2571.patch
        10 kB
        James Dyer

        Activity

        Hide
        James Dyer added a comment -

        Patches attached for Trunk & 3.x .

        This patch fixes the problem for IndexBasedSpellChecker. DirectSolrSpellChecker (Trunk only) appears to be correct already.

        Should this patch be committed, I will add documentation for "thresholdTokenFrequency" to the wiki. Currently it is absent from the wiki (although documented in Smiley&Pugh).

        Show
        James Dyer added a comment - Patches attached for Trunk & 3.x . This patch fixes the problem for IndexBasedSpellChecker. DirectSolrSpellChecker (Trunk only) appears to be correct already. Should this patch be committed, I will add documentation for "thresholdTokenFrequency" to the wiki. Currently it is absent from the wiki (although documented in Smiley&Pugh).
        Hide
        Robert Muir added a comment -

        Hi James, I'm confused about this one a little bit. Perhaps DirectSolrSpellChecker is actually wrong?

        If I configure the thing like this:

        <float name="thresholdTokenFrequency">0.07</float>
        

        Then it does apply the parameter. I guess what I'm asking is, if in general we should be using int/float/etc in these types and not <str> (especially DirectSolrSpellChecker which takes a lot of numeric parameters but expects them all to be <str>). Just glancing through solrconfig.xml its not clear that there is a precedent, it appears inconsistent as far as numeric parameters.

        Show
        Robert Muir added a comment - Hi James, I'm confused about this one a little bit. Perhaps DirectSolrSpellChecker is actually wrong? If I configure the thing like this: <float name="thresholdTokenFrequency">0.07</float> Then it does apply the parameter. I guess what I'm asking is, if in general we should be using int/float/etc in these types and not <str> (especially DirectSolrSpellChecker which takes a lot of numeric parameters but expects them all to be <str>). Just glancing through solrconfig.xml its not clear that there is a precedent, it appears inconsistent as far as numeric parameters.
        Hide
        James Dyer added a comment -

        ha...I was unaware solrconfig.xml could take <float /> elements at all. But now that you mention it, there's a couple of them in the example config file. I tried this for myself and it works.

        I guess with this being the case, I would agree that its better to change DirectSolrSpellChecker to require a Float, not a String. Once we decide "correctness" on this I'll also add this parameter to the wiki as its pretty much an undocumented feature...

        Show
        James Dyer added a comment - ha...I was unaware solrconfig.xml could take <float /> elements at all. But now that you mention it, there's a couple of them in the example config file. I tried this for myself and it works. I guess with this being the case, I would agree that its better to change DirectSolrSpellChecker to require a Float, not a String. Once we decide "correctness" on this I'll also add this parameter to the wiki as its pretty much an undocumented feature...
        Hide
        Mike Sokolov added a comment -

        sounds like a good case for a config schema

        Show
        Mike Sokolov added a comment - sounds like a good case for a config schema
        Hide
        Robert Muir added a comment -

        Mike, I think I agree: currently we are relying upon examples in the wiki, but in this case one did not exist and it was/is totally confusing.

        Show
        Robert Muir added a comment - Mike, I think I agree: currently we are relying upon examples in the wiki, but in this case one did not exist and it was/is totally confusing.
        Hide
        Mike Sokolov added a comment -

        I posted a patch in SOLR-1758 that has a preliminary schema and implements schema-checking when loading config files that could help

        Show
        Mike Sokolov added a comment - I posted a patch in SOLR-1758 that has a preliminary schema and implements schema-checking when loading config files that could help
        Hide
        James Dyer added a comment -

        I'm betting the jury will rule we keep this a <float /> element, so here's a patch that changes DirectSolrSpellChecker. I also added a unit test for thresholdTokenFrequency and added a (commented-out) line for it in the example solrconfig.xml.

        There are 3 TODO's in the unit test code:
        1. My ignorance of the expression language used in unit-tests lead mem write an old-style long-form unit test. If someone can show me how to convert this to a 1-liner I would be very appreciative.

        2. I found that DirectSolrSpellChecker returns results in a slightly different format than IndexBasedSpellChecker. Is this OK? Can SOLRJ handle this or do we need to tweak there?

        3. Also, in one case IndexBasedSpellChecker returns "correctlySpelled=false" while DirectSolrSpellChecker returns "correctlySpelled=true". Is this discrepancy valid?

        Show
        James Dyer added a comment - I'm betting the jury will rule we keep this a <float /> element, so here's a patch that changes DirectSolrSpellChecker. I also added a unit test for thresholdTokenFrequency and added a (commented-out) line for it in the example solrconfig.xml. There are 3 TODO's in the unit test code: 1. My ignorance of the expression language used in unit-tests lead mem write an old-style long-form unit test. If someone can show me how to convert this to a 1-liner I would be very appreciative. 2. I found that DirectSolrSpellChecker returns results in a slightly different format than IndexBasedSpellChecker. Is this OK? Can SOLRJ handle this or do we need to tweak there? 3. Also, in one case IndexBasedSpellChecker returns "correctlySpelled=false" while DirectSolrSpellChecker returns "correctlySpelled=true". Is this discrepancy valid?
        Hide
        Robert Muir added a comment -

        Thanks for updating the patch!

        I found that DirectSolrSpellChecker returns results in a slightly different format than IndexBasedSpellChecker. Is this OK? Can SOLRJ handle this or do we need to tweak there?

        Not sure, I have used DirectSolrSpellChecker with solrj and I didn't have any problems... but that's not saying there isn't one.

        Also, in one case IndexBasedSpellChecker returns "correctlySpelled=false" while DirectSolrSpellChecker returns "correctlySpelled=true". Is this discrepancy valid?

        I don't know, what makes this 'decision' of correctlySpelled? Do you know? Remember also the DirectSolrSpellChecker is a different spellchecker totally than IndexBasedSpellChecker (it uses a fundamentally different algorithm), although I tried to keep some of the parameters consistent.

        Another question is, there are lots of other float/int arguments to DirectSolrSpellChecker, maybe we should cut all of these over to <int> and <float> while we are here?

        Show
        Robert Muir added a comment - Thanks for updating the patch! I found that DirectSolrSpellChecker returns results in a slightly different format than IndexBasedSpellChecker. Is this OK? Can SOLRJ handle this or do we need to tweak there? Not sure, I have used DirectSolrSpellChecker with solrj and I didn't have any problems... but that's not saying there isn't one. Also, in one case IndexBasedSpellChecker returns "correctlySpelled=false" while DirectSolrSpellChecker returns "correctlySpelled=true". Is this discrepancy valid? I don't know, what makes this 'decision' of correctlySpelled? Do you know? Remember also the DirectSolrSpellChecker is a different spellchecker totally than IndexBasedSpellChecker (it uses a fundamentally different algorithm), although I tried to keep some of the parameters consistent. Another question is, there are lots of other float/int arguments to DirectSolrSpellChecker, maybe we should cut all of these over to <int> and <float> while we are here?
        Hide
        James Dyer added a comment -

        This version takes all of DirectSolrSpellChecker's parameters as Integer and Float objects rather than Strings, as appropriate. Also, I changed the "accuracy" parameter to use SpellingParams.SPELLCHECK_ACCURACY ... I'm not sure if this would have validated any unit tests (I didn't see any tests that use DirectSolrSpellChecker).

        I think this will make DirectSolrSpellChecker more consistent with the rest of "solrconfig.xml"s parameter requirements. The only better option than this, maybe, would to make it flexible and allow either the Int/Float or String in these cases. I think this later option is not necessary however.

        Show
        James Dyer added a comment - This version takes all of DirectSolrSpellChecker's parameters as Integer and Float objects rather than Strings, as appropriate. Also, I changed the "accuracy" parameter to use SpellingParams.SPELLCHECK_ACCURACY ... I'm not sure if this would have validated any unit tests (I didn't see any tests that use DirectSolrSpellChecker). I think this will make DirectSolrSpellChecker more consistent with the rest of "solrconfig.xml"s parameter requirements. The only better option than this, maybe, would to make it flexible and allow either the Int/Float or String in these cases. I think this later option is not necessary however.
        Hide
        James Dyer added a comment -

        what makes this 'decision' of correctlySpelled? Do you know?

        I took a quick look to find out. Its more complicated than I thought! Here's the basic jist (I think!) :

        • If the instance of SolrSpellChecker returns frequency data and all suggestions have frequency >0, TRUE.
        • If the instance of SolrSpellChecker returns frequency data and any suggestion have frequency == 0, FALSE.
        • If the instance of SolrSpellChecker returns NO frequency data but has suggestions, OMIT.
        • If the instance of SolrSpellChecker returns NO suggestions, FALSE.

        Possibly this isn't fully accurate but I'm at least mostly correct here. Seems like the discrepency with DirectSolrSpellChecker is because it isn't returning Frequency info?

        This all happens in SpellCheckComponent.toNamedList() ... I'm guessing the code here uses the presence or absence of frequency data as kind of a proxy indicator whether or not its dealing with IndexBasedSpellChecker or FileBasedSpellChecker. Possibly it would be better if each instance of SolrSpellChecker had a "isCorrectlySpelled()" method that toNamedList() could call? Maybe I should I go open another jira issue for that?

        Show
        James Dyer added a comment - what makes this 'decision' of correctlySpelled? Do you know? I took a quick look to find out. Its more complicated than I thought! Here's the basic jist (I think!) : If the instance of SolrSpellChecker returns frequency data and all suggestions have frequency >0, TRUE. If the instance of SolrSpellChecker returns frequency data and any suggestion have frequency == 0, FALSE. If the instance of SolrSpellChecker returns NO frequency data but has suggestions, OMIT. If the instance of SolrSpellChecker returns NO suggestions, FALSE. Possibly this isn't fully accurate but I'm at least mostly correct here. Seems like the discrepency with DirectSolrSpellChecker is because it isn't returning Frequency info? This all happens in SpellCheckComponent.toNamedList() ... I'm guessing the code here uses the presence or absence of frequency data as kind of a proxy indicator whether or not its dealing with IndexBasedSpellChecker or FileBasedSpellChecker. Possibly it would be better if each instance of SolrSpellChecker had a "isCorrectlySpelled()" method that toNamedList() could call? Maybe I should I go open another jira issue for that?
        Hide
        Robert Muir added a comment -

        This version takes all of DirectSolrSpellChecker's parameters as Integer and Float objects rather than Strings, as appropriate.

        Did you maybe upload an older patch? I took a look and it only seems to cutover the threshold param.

        I'm not sure if this would have validated any unit tests (I didn't see any tests that use DirectSolrSpellChecker).

        There is a test (DirectSolrSpellCheckerTest), but its probably not that great

        Show
        Robert Muir added a comment - This version takes all of DirectSolrSpellChecker's parameters as Integer and Float objects rather than Strings, as appropriate. Did you maybe upload an older patch? I took a look and it only seems to cutover the threshold param. I'm not sure if this would have validated any unit tests (I didn't see any tests that use DirectSolrSpellChecker). There is a test (DirectSolrSpellCheckerTest), but its probably not that great
        Hide
        Robert Muir added a comment -

        Possibly this isn't fully accurate but I'm at least mostly correct here. Seems like the discrepency with DirectSolrSpellChecker is because it isn't returning Frequency info?

        This sounds like a bug, care to open a separate issue on it? (we can resolve the int/float stuff here on this one).

        The thing certainly intends to return freq info...

        SuggestWord[] suggestions = checker.suggestSimilar(new Term(field, token.toString()), 
                  options.count, options.reader, options.onlyMorePopular, accuracy);
              for (SuggestWord suggestion : suggestions)
                result.add(token, suggestion.string, suggestion.freq);
        
        Show
        Robert Muir added a comment - Possibly this isn't fully accurate but I'm at least mostly correct here. Seems like the discrepency with DirectSolrSpellChecker is because it isn't returning Frequency info? This sounds like a bug, care to open a separate issue on it? (we can resolve the int/float stuff here on this one). The thing certainly intends to return freq info... SuggestWord[] suggestions = checker.suggestSimilar(new Term(field, token.toString()), options.count, options.reader, options.onlyMorePopular, accuracy); for (SuggestWord suggestion : suggestions) result.add(token, suggestion.string, suggestion.freq);
        Hide
        James Dyer added a comment -

        Here is that patch with Ints/Floats instead of Strings. I made a tiny adjustment to the unit test also.

        Show
        James Dyer added a comment - Here is that patch with Ints/Floats instead of Strings. I made a tiny adjustment to the unit test also.
        Hide
        Robert Muir added a comment -

        Committed revision 1132855 (trunk).
        I organized the constants in DirectSolrSpellchecker a bit, so its easy to see which ones are 'shared' with the others and which ones are unique to it.

        Committed revision 1132856 (branch_3x).
        I backported the test and example here. In the case of this test, it needed to clearIndex() in setup() like trunk does, so I merged these bits also.

        Thanks James!

        Show
        Robert Muir added a comment - Committed revision 1132855 (trunk). I organized the constants in DirectSolrSpellchecker a bit, so its easy to see which ones are 'shared' with the others and which ones are unique to it. Committed revision 1132856 (branch_3x). I backported the test and example here. In the case of this test, it needed to clearIndex() in setup() like trunk does, so I merged these bits also. Thanks James!
        Hide
        James Dyer added a comment -

        I added "thresholdTokenFrequency" to the SpellCheckComponent wiki page.

        Show
        James Dyer added a comment - I added "thresholdTokenFrequency" to the SpellCheckComponent wiki page.
        Hide
        Robert Muir added a comment -

        Bulk close for 3.3

        Show
        Robert Muir added a comment - Bulk close for 3.3

          People

          • Assignee:
            Robert Muir
            Reporter:
            James Dyer
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development