Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-230

OPIC score for outlinks should be based on # of valid links, not total # of links.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.8
    • None
    • None
    • None

    Description

      In ParseOutputFormat.java, the write() method currently divides the page score by the # of outlinks:

      score /= links.length;

      It then loops over the links, and any that pass the normalize/filter gauntlet get added to the crawl output.

      But this means that any filtered links result in some amount of the page's OPIC score being "lost".

      For Nutch 0.7, I built a list of valid (post-filter) links, and then used that to determine the per-link OPIC score, after which I iterated over the list, adding entries to the crawl output.

      Attachments

        1. patch.txt
          2 kB
          Andrzej Bialecki

        Activity

          People

            Unassigned Unassigned
            kkrugler Kenneth William Krugler
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: