[NUTCH-572] Scoring and redirected Urls - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Invalid
Affects Version/s: 0.8, 0.8.1, 0.9.0
Fix Version/s: 1.0.0
Component/s: fetcher
Labels:
None
Environment:

All

Description

When a redirect is found for a given url, the new or end url is stored as the content page and the old CrawlDatum get one of a few redirect codes. The page that gets indexed in Nutch is the end page and it gets indexed under the end url. Many times a site will have a significant number of links pointing to start page and very few pointing to the redirected end page. This is especially true for external links. Opic scores do not get transfered to the end page but stay with the start page (the one doing the redirecting). But the start page doesn't get indexed. Hence the end page will show up in the index but under a usually much reduced score. A good example of this is cnn.com:

URL: http://www.cnn.com/
Version: 6
Status: 5 (db_redir_perm)
Fetch time: Tue Dec 04 11:02:09 CST 2007
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 51.19438
Signature: b5baaf80e9e10aa6205fc39051c362ff
Metadata: pst:success(1), lastModified=0

which redirects to http://www.cnn.com/?refresh=1

URL: http://www.cnn.com/?refresh=1
Version: 6
Status: 2 (db_fetched)
Fetch time: Tue Dec 04 11:02:11 CST 2007
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: b5baaf80e9e10aa6205fc39051c362ff
Metadata: pst:success(1), lastModified=0

Now, cnn which should be one of the highest, if not the highest ranking site in the index for keywords such as news in fact doesn't show up in the index and it's redirected end page appears much farther down in search results. My proposal is we somehow make OPIC scores follow redirects. To do this we would most likely need to store a start and end url for redirected urls.

Attachments

Issue Links

is related to

NUTCH-411 Parse ignores meta refresh redirection

Closed

NUTCH-547 Redirection handling: YahooSlurp's algorithm

Closed

Activity

People

Assignee:: Dennis Kubes

Reporter:: Dennis Kubes

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 04/Nov/07 21:41

Updated:: 10/Apr/09 12:29

Resolved:: 20/Jan/09 15:58