Issue Details (XML | Word | Printable)

Key: NUTCH-273
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Andrzej Bialecki
Reporter: Lukas Vlcek
Votes: 5
Watchers: 6
Operations

If you were logged in you would be able to see more operations.
Nutch

When a page is redirected, the original url is NOT updated.

Created: 20/May/06 04:23 PM   Updated: 28/Dec/06 12:18 AM
Return to search
Component/s: fetcher
Affects Version/s: 0.8
Fix Version/s: 0.9.0

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works Fetcher.java-489586.diff 2006-12-22 09:38 AM Eelco Lempsink 0.6 kB
Environment: n/a
Issue Links:
Dependants
 
Incorporates
 
Reference
 

Resolution Date: 28/Dec/06 12:18 AM


 Description  « Hide
[Excerpt from maillist, sender: Andrzej Bialecki]
When a page is redirected, the original url is NOT updated - so, CrawlDB will never know that a redirect occured, it won't even know that a fetch occured... This looks like a bug.
In 0.7 this was recorded in the segment, and then it would affect the Page status during updatedb. It should do so 0.8, too...

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doug Cutting added a comment - 27/May/06 03:27 AM
Redirects should really not be followed immediately anyway. We should instead note that it was redirected and to which URL in the fetcher output. Then, when the crawl db is updated with the fetcher output, the target of the redirect should be added, with the full OPIC score of the original URL. This will enable proper politeness guarantees.

It would be nice to still associate the original URL with the content of the redirect URL when indexing. Perhaps a list of URLs that redirected to each page could be kept in the CrawlDatum metadata? Can anyone think of a better way to implement this?


Lukas Vlcek added a comment - 28/May/06 03:37 AM
May be I am wrong but handling redirects can be very complex topic and I am not sure if general solution can be easily found.

Right now I am facing to the following issue: we have a legacy document repository on corporate intranet (accessed via http) and people made a lot of links to it during the years but they never updated old html files with old links... so the result is that we have tons of links to documents that are already gone. If such documents are requested then document repository simply redirect such requests to default page (a main page in this case).

For example ulr links http://some.repo/executive_success.pdf and http://some.repo/individual_failure.doc are both redirected to the same default main page with unrelated content (it can be contact list for example). Does it mean that executive_success and individual_failure are both related to contact list?

I am not sure how much work nutch plugins could do for us here but to me it seems that handling redirects should be very flexible. Would it help if redirect handling is extracted out of nutch-core into plugin system?


Chris Schneider made changes - 23/Aug/06 10:04 PM
Field Original Value New Value
Link This issue blocks NUTCH-322 [ NUTCH-322 ]
Chris Schneider added a comment - 23/Aug/06 10:04 PM
All of these issues have to do with redirection not updating the original URL in the crawldb

Chris Schneider made changes - 23/Aug/06 10:04 PM
Link This issue blocks NUTCH-353 [ NUTCH-353 ]
Chris Schneider made changes - 23/Aug/06 10:05 PM
Link This issue blocks NUTCH-353 [ NUTCH-353 ]
Chris Schneider made changes - 23/Aug/06 10:05 PM
Link This issue blocks NUTCH-322 [ NUTCH-322 ]
Chris Schneider made changes - 23/Aug/06 10:08 PM
Link This issue blocks NUTCH-353 [ NUTCH-353 ]
Chris Schneider made changes - 23/Aug/06 10:11 PM
Link This issue is part of NUTCH-322 [ NUTCH-322 ]
Chris Schneider added a comment - 23/Aug/06 10:18 PM
Another reason why it would be better to wait until the next segment to process the target of the redirect is that this target may already have been fetched. In this case, there's no need to refetch it. More importantly, though, refetching the page will cause its OPIC score to be distributed a second time to its outlinks. In fact, each page that redirects to the target page will cause the target page's OPIC score to get redistributed.

I honestly can't see a good reason for doing an immediate redirect, since hopefully these cases aren't common enough to make a significant difference to crawling performance.

Note that there are several other issues related to this issue, so we should take care to satisfy the goals of all with any fix. In particular, I agree that we should be saving more information in the metadata about the redirection (as well as other protocol cases).


Sami Siren added a comment - 07/Sep/06 06:13 PM
+1 for not following redirects immediately - simplify fetcher logic.

I would also like to see a flexible (configurable?) solution not a one size fits all because there's conflicting requirements (or atleast opinions) around this topic.


Chris Schneider made changes - 25/Sep/06 04:21 PM
Link This issue is related to NUTCH-371 [ NUTCH-371 ]
Johannes Zillmann added a comment - 19/Nov/06 04:13 PM
As a consequence of this issue a crawl could be permanently blocked.
Imagine the top 5 mio of crawldb are all redirect urls whose targets has already been fetched.
Then you can successfully generate 5 mio, fetch 5 mio and parse 5 mio, but after an update of the crawldb, nothing has happened!

Stefan Groschupf added a comment - 25/Nov/06 10:39 AM
I agree this is a serious problem for any production use of nutch - a blocker since you end up refetching again and again the same pages.

Stefan Groschupf made changes - 25/Nov/06 10:39 AM
Priority Major [ 3 ] Blocker [ 1 ]
Eelco Lempsink added a comment - 22/Dec/06 09:38 AM
Let's not overcomplicate this issue. At the moment, two different problems of different priorities are mixed in one issue.

Problem 1, blocker: The status of the URL causing the redirect isn't updated. Fixing that is not hard, attached is a one-liner patch. Hopefully this can be applied soon.

Problem 2, minor: Should redirects be fetched immediately or not? One argument to fetch it immediately is that otherwise the redirectCount should be moved into the CrawlDatum (metadata). If it's possible (in Jira) I suggest this problem should be split into a different issue.


Eelco Lempsink made changes - 22/Dec/06 09:38 AM
Attachment Fetcher.java-489586.diff [ 12347719 ]
Repository Revision Date User Message
ASF #490607 Thu Dec 28 00:03:04 UTC 2006 ab This patch addresses several issues:

* NUTCH-415 - Generator should mark selected records in CrawlDb.
  Due to increased resource consumption this step is optional.
  Application-level locking has been added to prevent concurrent
  modification of databases.

* NUTCH-416 - CrawlDatum status and CrawlDbReducer refactoring. It is
  now possible to correctly update CrawlDb from multiple segments.
  Introduce new status codes for temporary and permanent
  redirection.

* NUTCH-322 - Fix Fetcher to store redirected pages and to store
  protocol-level status. This also should fix NUTCH-273.

* Change default Fetcher behavior not to follow redirects immediately.
  Instead Fetcher will record redirects as new pages to be added to CrawlDb.
  This also partially addresses NUTCH-273.

* Detect and report when Generator creates 0-sized segments.

* Fix Injector to preserve already existing CrawlDatum if the seed list
  being injected also contains such URL.

This development was partially supported by SiteSell Inc.
Files Changed
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbMerger.java
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbMerger.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/tools/compat/CrawlDbConverter.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/Indexer.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/crawl/TestInjector.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/crawl/TestGenerator.java
MODIFY /lucene/nutch/trunk/CHANGES.txt
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/metadata/Nutch.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDb.java
MODIFY /lucene/nutch/trunk/src/test/org/apache/nutch/fetcher/TestFetcher.java
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
MODIFY /lucene/nutch/trunk/conf/nutch-default.xml
MODIFY /lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
ADD /lucene/nutch/trunk/src/java/org/apache/nutch/util/LockUtil.java

Andrzej Bialecki added a comment - 28/Dec/06 12:18 AM
Fixed in trunk/, rev. 490607 .

Andrzej Bialecki made changes - 28/Dec/06 12:18 AM
Status Open [ 1 ] Closed [ 6 ]
Resolution Fixed [ 1 ]
Fix Version/s 0.9.0 [ 12312013 ]
Assignee Andrzej Bialecki [ ab ]