Issue Details (XML | Word | Printable)

Key: NUTCH-273
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Andrzej Bialecki
Reporter: Lukas Vlcek
Votes: 5
Watchers: 6
Operations

If you were logged in you would be able to see more operations.
Nutch

When a page is redirected, the original url is NOT updated.

Created: 20/May/06 04:23 PM   Updated: 28/Dec/06 12:18 AM
Return to search
Component/s: fetcher
Affects Version/s: 0.8
Fix Version/s: 0.9.0

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works Fetcher.java-489586.diff 2006-12-22 09:38 AM Eelco Lempsink 0.6 kB
Environment: n/a
Issue Links:
Dependants
 
Incorporates
 
Reference
 

Resolution Date: 28/Dec/06 12:18 AM


 Description  « Hide
[Excerpt from maillist, sender: Andrzej Bialecki]
When a page is redirected, the original url is NOT updated - so, CrawlDB will never know that a redirect occured, it won't even know that a fetch occured... This looks like a bug.
In 0.7 this was recorded in the segment, and then it would affect the Page status during updatedb. It should do so 0.8, too...

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doug Cutting added a comment - 27/May/06 03:27 AM
Redirects should really not be followed immediately anyway. We should instead note that it was redirected and to which URL in the fetcher output. Then, when the crawl db is updated with the fetcher output, the target of the redirect should be added, with the full OPIC score of the original URL. This will enable proper politeness guarantees.

It would be nice to still associate the original URL with the content of the redirect URL when indexing. Perhaps a list of URLs that redirected to each page could be kept in the CrawlDatum metadata? Can anyone think of a better way to implement this?


Lukas Vlcek added a comment - 28/May/06 03:37 AM
May be I am wrong but handling redirects can be very complex topic and I am not sure if general solution can be easily found.

Right now I am facing to the following issue: we have a legacy document repository on corporate intranet (accessed via http) and people made a lot of links to it during the years but they never updated old html files with old links... so the result is that we have tons of links to documents that are already gone. If such documents are requested then document repository simply redirect such requests to default page (a main page in this case).

For example ulr links http://some.repo/executive_success.pdf and http://some.repo/individual_failure.doc are both redirected to the same default main page with unrelated content (it can be contact list for example). Does it mean that executive_success and individual_failure are both related to contact list?

I am not sure how much work nutch plugins could do for us here but to me it seems that handling redirects should be very flexible. Would it help if redirect handling is extracted out of nutch-core into plugin system?


Chris Schneider added a comment - 23/Aug/06 10:04 PM
All of these issues have to do with redirection not updating the original URL in the crawldb

Chris Schneider added a comment - 23/Aug/06 10:18 PM
Another reason why it would be better to wait until the next segment to process the target of the redirect is that this target may already have been fetched. In this case, there's no need to refetch it. More importantly, though, refetching the page will cause its OPIC score to be distributed a second time to its outlinks. In fact, each page that redirects to the target page will cause the target page's OPIC score to get redistributed.

I honestly can't see a good reason for doing an immediate redirect, since hopefully these cases aren't common enough to make a significant difference to crawling performance.

Note that there are several other issues related to this issue, so we should take care to satisfy the goals of all with any fix. In particular, I agree that we should be saving more information in the metadata about the redirection (as well as other protocol cases).


Sami Siren added a comment - 07/Sep/06 06:13 PM
+1 for not following redirects immediately - simplify fetcher logic.

I would also like to see a flexible (configurable?) solution not a one size fits all because there's conflicting requirements (or atleast opinions) around this topic.


Johannes Zillmann added a comment - 19/Nov/06 04:13 PM
As a consequence of this issue a crawl could be permanently blocked.
Imagine the top 5 mio of crawldb are all redirect urls whose targets has already been fetched.
Then you can successfully generate 5 mio, fetch 5 mio and parse 5 mio, but after an update of the crawldb, nothing has happened!

Stefan Groschupf added a comment - 25/Nov/06 10:39 AM
I agree this is a serious problem for any production use of nutch - a blocker since you end up refetching again and again the same pages.

Eelco Lempsink added a comment - 22/Dec/06 09:38 AM
Let's not overcomplicate this issue. At the moment, two different problems of different priorities are mixed in one issue.

Problem 1, blocker: The status of the URL causing the redirect isn't updated. Fixing that is not hard, attached is a one-liner patch. Hopefully this can be applied soon.

Problem 2, minor: Should redirects be fetched immediately or not? One argument to fetch it immediately is that otherwise the redirectCount should be moved into the CrawlDatum (metadata). If it's possible (in Jira) I suggest this problem should be split into a different issue.


Andrzej Bialecki added a comment - 28/Dec/06 12:18 AM
Fixed in trunk/, rev. 490607 .