|
May be I am wrong but handling redirects can be very complex topic and I am not sure if general solution can be easily found.
Right now I am facing to the following issue: we have a legacy document repository on corporate intranet (accessed via http) and people made a lot of links to it during the years but they never updated old html files with old links... so the result is that we have tons of links to documents that are already gone. If such documents are requested then document repository simply redirect such requests to default page (a main page in this case). For example ulr links http://some.repo/executive_success.pdf I am not sure how much work nutch plugins could do for us here but to me it seems that handling redirects should be very flexible. Would it help if redirect handling is extracted out of nutch-core into plugin system? All of these issues have to do with redirection not updating the original URL in the crawldb
Another reason why it would be better to wait until the next segment to process the target of the redirect is that this target may already have been fetched. In this case, there's no need to refetch it. More importantly, though, refetching the page will cause its OPIC score to be distributed a second time to its outlinks. In fact, each page that redirects to the target page will cause the target page's OPIC score to get redistributed.
I honestly can't see a good reason for doing an immediate redirect, since hopefully these cases aren't common enough to make a significant difference to crawling performance. Note that there are several other issues related to this issue, so we should take care to satisfy the goals of all with any fix. In particular, I agree that we should be saving more information in the metadata about the redirection (as well as other protocol cases). +1 for not following redirects immediately - simplify fetcher logic.
I would also like to see a flexible (configurable?) solution not a one size fits all because there's conflicting requirements (or atleast opinions) around this topic. As a consequence of this issue a crawl could be permanently blocked.
Imagine the top 5 mio of crawldb are all redirect urls whose targets has already been fetched. Then you can successfully generate 5 mio, fetch 5 mio and parse 5 mio, but after an update of the crawldb, nothing has happened! I agree this is a serious problem for any production use of nutch - a blocker since you end up refetching again and again the same pages.
Let's not overcomplicate this issue. At the moment, two different problems of different priorities are mixed in one issue.
Problem 1, blocker: The status of the URL causing the redirect isn't updated. Fixing that is not hard, attached is a one-liner patch. Hopefully this can be applied soon. Problem 2, minor: Should redirects be fetched immediately or not? One argument to fetch it immediately is that otherwise the redirectCount should be moved into the CrawlDatum (metadata). If it's possible (in Jira) I suggest this problem should be split into a different issue. Fixed in trunk/, rev. 490607 .
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
It would be nice to still associate the original URL with the content of the redirect URL when indexing. Perhaps a list of URLs that redirected to each page could be kept in the CrawlDatum metadata? Can anyone think of a better way to implement this?