|
[
Permlink
| « Hide
]
Doug Cutting added a comment - 09/May/06 09:37 AM
The OPIC score is much like a count of incoming links, but a bit more refined. OPIC(P) is one plus the sum of the OPIC contributions for all links to a page. The OPIC contribution of a link from page P is OPIC(P) / numOutLinks(P).
I would argue that what Nutch implements now shouldn't be called OPIC, because it has little to do with the algorithm described in the OPIC paper. Either we fix it, or we should rename it. Let me explain:
I'm going to commit the scoring API soon, this should make it easier to experiment with different scoring models. Andrzej: your analysis is correct, but it mostly only applies when re-crawling. In an initial crawl, where each url is fetched only once, I think we implement the OPIC "Greedy" strategy. The question of what to do when re-crawling has not been adequately answered, but, glancing at the paper, it seems that resetting a urls score to zero each time it is fetched might be the best thing to do, so that it can start accumulating more cash.
When ranking, summing logs is the same as multiplying, no? Hmm, resetting the score to 0 is also dubious - it's as if we didn't want it to be re-crawled if we can't find any inlinks to it... I believe it should be reset to the following value:
newScore = initialScore - sum(distributedScoreM) + sum(incomingScoreN) where initialScore is the score we got from previous iterations (or injectedScore), sum(distributedScoreM) is what we have distributed to M outlinks from that page, and sum(incomingScoreN) is what is contributed by N inlinks. Current formula omits the sum(distributedScoreM); it also doesn't provide any way to "sponsor" pages with no incoming links so that they won't get broke (the concept of "virtual nodes" I mentioned above). Re: summing logs: yes, but then why use "sqrt(opic) * docSimilarity" instead of "log(opic * docSimilarity)"? re: it's as if we didn't want it to be re-crawled if we can't find any inlinks to it
We prioritize crawling based on the number of pages we've crawled that link to it since we've last crawled it. Assuming it had links to it that caused it to be crawled the first time, and that some of those will also be re-crawled, then its score will again increase. But if no one links to it anymore, it will languish, and not be crawled again unless there're no higher-scoring pages. That sounds right to me, and I think it's what's suggested in the OPIC paper (if i skimmed it correctly). Perhaps it should not be reset to zero, but one, since that's where pages start out. re: why use "sqrt(opic) * docSimilarity" instead of "log(opic * docSimilarity)" Wrapping log() around things changes the score value but not the ranking. So the question is really, why use sqrt(opic)*docSimilarity and not just opic*docSimilarity? The answer is simply that I tried a few queries and sqrt seemed to be required for OPIC to not overly dominate scoring. It was a "seat of the pants" calculation, trying to balance the strength of anchor matches, opic scoring and title, url and body matching, etc. One can disable this by changing the score power parameter. |
||||||||||||||||||||||||||||||||||||||||||||||