Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-4133

FastVectorHighlighter: A weighted approach for ordered fragments

    Details

    • Lucene Fields:
      New, Patch Available

      Description

      The FastVectorHighlighter currently disregards IDF-weights for matching terms within generated fragments. In the worst case, a fragment, which contains high number of very common words, is scored higher, than a fragment that contains all of the terms which have been used in the original query.

      This patch provides ordered fragments with IDF-weighted terms:

      For each distinct matching term per fragment:
      weight = weight + IDF * boost

      For each fragment:
      weight = weight * length * 1 / sqrt( length )

      weight total weight of fragment
      IDF inverse document frequency for each distinct matching term
      boost query boost as provided, for example term^2
      length total number of non-distinct matching terms per fragment

      Method:

        public void add( int startOffset, int endOffset, List<WeightedPhraseInfo> phraseInfoList ) {
          
          float totalBoost = 0;
          
          List<SubInfo> subInfos = new ArrayList<SubInfo>();
          HashSet<String> distinctTerms = new HashSet<String>();
          
          int length = 0;
      
          for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
            subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffsets(), phraseInfo.getSeqnum() ) );
            for ( TermInfo ti :  phraseInfo.getTermsInfos()) {
              if ( distinctTerms.add( ti.getText() ) )
                totalBoost += ti.getWeight() * phraseInfo.getBoost();
              length++;
            }
          }
          totalBoost *= length * ( 1 / Math.sqrt( length ) );
          
          getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos, totalBoost ) );
        }
      

      The ranking-formula should be the same, or at least similar, to that one used in QueryTermScorer.

      This patch contains:

      • a changed class-member in FieldPhraseList (termInfos to termsInfos)
      • a changed local variable in SimpleFieldFragList (score to totalBoost)
      • adds a missing @override in SimpleFragListBuilder
      • class WeightedFieldFragList, a implementation of FieldFragList
      • class WeightedFragListBuilder, a implementation of BaseFragListBuilder
      • class WeightedFragListBuilderTest, a simple test-case
      • updated docs for FVH

      Last part (see also LUCENE-4091, LUCENE-4107, LUCENE-4113) of LUCENE-3440.

        Attachments

        1. LUCENE-4133.patch
          14 kB
          Koji Sekiguchi
        2. LUCENE-4133.patch
          14 kB
          Sebastian Lutze

          Activity

            People

            • Assignee:
              koji Koji Sekiguchi
              Reporter:
              mdz-munich Sebastian Lutze
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: