Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-134

Summarizer doesn't select the best snippets

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.7, 0.7.1, 0.7.2, 0.8
    • 0.8
    • None
    • None

    Description

      Summarizer.java tries to select the best fragments from the input text, where the frequency of query terms is the highest. However, the logic in line 223 is flawed in that the excerptSet.add() operation will add new excerpts only if they are not already present - the test is performed using the Comparator that compares only the numUniqueTokens. This means that if there are two or more excerpts, which score equally high, only the first of them will be retained, and the rest of equally-scoring excerpts will be discarded, in favor of other excerpts (possibly lower scoring).

      To fix this the Set should be replaced with a List + a sort operation. To keep the relative position of excerpts in the original order the Excerpt class should be extended with an "int order" field, and the collected excerpts should be sorted in that order prior to adding them to the summary.

      Attachments

        1. summarizer.060506.patch
          40 kB
          Jerome Charron

        Issue Links

          Activity

            People

              jerome.charron Jerome Charron
              ab Andrzej Bialecki
              Votes:
              6 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: