Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
4.0-ALPHA, 6.0
-
New, Patch Available
Description
The FastVectorHighlighter currently disregards IDF-weights for matching terms within generated fragments. In the worst case, a fragment, which contains high number of very common words, is scored higher, than a fragment that contains all of the terms which have been used in the original query.
This patch provides ordered fragments with IDF-weighted terms:
For each distinct matching term per fragment:
weight = weight + IDF * boost
For each fragment:
weight = weight * length * 1 / sqrt( length )
weight | total weight of fragment |
IDF | inverse document frequency for each distinct matching term |
boost | query boost as provided, for example term^2 |
length | total number of non-distinct matching terms per fragment |
Method:
public void add( int startOffset, int endOffset, List<WeightedPhraseInfo> phraseInfoList ) { float totalBoost = 0; List<SubInfo> subInfos = new ArrayList<SubInfo>(); HashSet<String> distinctTerms = new HashSet<String>(); int length = 0; for( WeightedPhraseInfo phraseInfo : phraseInfoList ){ subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffsets(), phraseInfo.getSeqnum() ) ); for ( TermInfo ti : phraseInfo.getTermsInfos()) { if ( distinctTerms.add( ti.getText() ) ) totalBoost += ti.getWeight() * phraseInfo.getBoost(); length++; } } totalBoost *= length * ( 1 / Math.sqrt( length ) ); getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos, totalBoost ) ); }
The ranking-formula should be the same, or at least similar, to that one used in QueryTermScorer.
This patch contains:
- a changed class-member in FieldPhraseList (termInfos to termsInfos)
- a changed local variable in SimpleFieldFragList (score to totalBoost)
- adds a missing @override in SimpleFragListBuilder
- class WeightedFieldFragList, a implementation of FieldFragList
- class WeightedFragListBuilder, a implementation of BaseFragListBuilder
- class WeightedFragListBuilderTest, a simple test-case
- updated docs for FVH
Last part (see also LUCENE-4091, LUCENE-4107, LUCENE-4113) of LUCENE-3440.