Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 0.8
    • Component/s: None
    • Labels:
      None

      Description

      Currently the method ScoringFilter.updateDbScore() doesn't use the "old" value from existing CrawlDB. Instead it uses the value taken from the fetchlist from the current segment, which represents a snapshot of the "old" value taken at the moment of generating the fetchlist.

      The problem with this approach is that if/when we add a possibility to interleave generate/fetch/update cycles, the initial score values in CrawlDatum instance that comes from the current segment could be already outdated, if another updatedb was run in the meantime, which changed the DB score.

      For this reason we should always assume that the value from CrawlDB, if exists, represents the most recent version of CrawlDatum before the update, and use this instance as a base.

      1. patch.txt
        4 kB
        Andrzej Bialecki

        Activity

        Hide
        Andrzej Bialecki added a comment -

        Proposed improvements. If there are no objections I'll commit them shortly.

        NOTE: this changes the API, but since v. 0.8 is still unreleased I feel it's the right time to do it.

        Show
        Andrzej Bialecki added a comment - Proposed improvements. If there are no objections I'll commit them shortly. NOTE: this changes the API, but since v. 0.8 is still unreleased I feel it's the right time to do it.
        Hide
        Andrzej Bialecki added a comment -

        Patch applied to trunk/ .

        NOTE: this requires a (trivial) change in any custom scoring plugin. Most likely, to accomodate for the future support for interleaved fetching cycles, you should use the "old" CrawlDatum as a basis for the initial score to be updated, instead of the "datum" (which is a snapshot of the value at the time of generating the fetchlist).

        Show
        Andrzej Bialecki added a comment - Patch applied to trunk/ . NOTE: this requires a (trivial) change in any custom scoring plugin. Most likely, to accomodate for the future support for interleaved fetching cycles, you should use the "old" CrawlDatum as a basis for the initial score to be updated, instead of the "datum" (which is a snapshot of the value at the time of generating the fetchlist).

          People

          • Assignee:
            Unassigned
            Reporter:
            Andrzej Bialecki
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development