Description
After reading Yahoo's algorithm (then one Andrzej linked to:
http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
in the redirect/alias handling discussion, I had a bit of a spare
time, so I implemented it.
Note that the patch I am attaching is for the 'choosing' algorithm described in
Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).
E.g,
generate "http://www.milliyet.com.tr/"
fetch "http:/www.milliyet.com.tr/" which redirects to
"http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
Update second page's datum's metadata to indicate that
"http://www.milliyet.com.tr/" is the representative form.
Updatedb, invertlinks, etc...
While indexing second page, change its "url" field to
"http://www.milliyet.com.tr/".
Attachments
Attachments
Issue Links
- relates to
-
NUTCH-572 Scoring and redirected Urls
- Closed