Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
0.8
-
None
-
None
-
None
Description
In ParseOutputFormat.java, the write() method currently divides the page score by the # of outlinks:
score /= links.length;
It then loops over the links, and any that pass the normalize/filter gauntlet get added to the crawl output.
But this means that any filtered links result in some amount of the page's OPIC score being "lost".
For Nutch 0.7, I built a list of valid (post-filter) links, and then used that to determine the per-link OPIC score, after which I iterated over the list, adding entries to the crawl output.