Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
8.2
-
None
-
None
-
The issue is a generic Java defect and therefore will be independent of the operating system or software platform.
Description
Investigating the output from the "features()" stream source, terms are being returned with NaN for the score_f field:
{{ "docs": [}}
{{ {}}
{{ "featureSet_s": "business",}}
{{ "score_f": "NaN",}}
{{ "term_s": "1,011.15",}}
{{ "idf_d": "-Infinity",}}
{{ "index_i": 1,}}
{{ "id": "business_1"}}
{{ },}}
{{ {}}
{{ "featureSet_s": "business",}}
{{ "score_f": "NaN",}}
{{ "term_s": "10.3m",}}
{{ "idf_d": "-Infinity",}}
{{ "index_i": 2,}}
{{ "id": "business_2"}}
{{ },}}
{{ {}}
{{ "featureSet_s": "business",}}
{{ "score_f": "NaN",}}
{{ "term_s": "01",}}
{{ "idf_d": "-Infinity",}}
{{ "index_i": 3,}}
{{ "id": "business_3"}}
{{ },...}}
Looking into{{ org/apache/solr/search/IGainTermsQParserPlugin.java}}, it seems that when a term is not included in the positive or negative documents, the docFreq calculation (docFreq = xc + nc) is 0, which means that subsequent calculations result in NaN (division by 0).
Attached is a patch which skips terms for which docFreq
is 0 in the finish() method of IGainTermsQParserPlugin and this resolves the issues with NaN scores in the features() output.