Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-321

Scoring API deficiency

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8
    • 0.8
    • None
    • None

    Description

      Currently the method ScoringFilter.updateDbScore() doesn't use the "old" value from existing CrawlDB. Instead it uses the value taken from the fetchlist from the current segment, which represents a snapshot of the "old" value taken at the moment of generating the fetchlist.

      The problem with this approach is that if/when we add a possibility to interleave generate/fetch/update cycles, the initial score values in CrawlDatum instance that comes from the current segment could be already outdated, if another updatedb was run in the meantime, which changed the DB score.

      For this reason we should always assume that the value from CrawlDB, if exists, represents the most recent version of CrawlDatum before the update, and use this instance as a base.

      Attachments

        1. patch.txt
          4 kB
          Andrzej Bialecki

        Activity

          People

            Unassigned Unassigned
            ab Andrzej Bialecki
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: