Issue Details (XML | Word | Printable)

Key: NUTCH-547
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Doğacan Güney
Reporter: Doğacan Güney
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Nutch

Redirection handling: YahooSlurp's algorithm

Created: 03/Sep/07 07:46 AM   Updated: 10/Apr/09 12:29 PM
Return to search
Component/s: fetcher
Affects Version/s: None
Fix Version/s: 1.0.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works NUTCH-547-3.patch 2007-11-07 07:23 PM Dennis Kubes 26 kB
Text File Licensed for inclusion in ASF works redirect_draft.patch 2007-09-03 07:48 AM Doğacan Güney 25 kB
Text File Licensed for inclusion in ASF works redirect_draft_v2.patch 2007-09-20 02:05 PM Doğacan Güney 26 kB
Issue Links:
Reference
 

Resolution Date: 08/Nov/07 01:18 PM


 Description  « Hide
After reading Yahoo's algorithm (then one Andrzej linked to:
http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
in the redirect/alias handling discussion, I had a bit of a spare
time, so I implemented it.

Note that the patch I am attaching is for the 'choosing' algorithm described in
Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362 for the discussion about alias handling).

E.g,
generate "http://www.milliyet.com.tr/"

fetch "http:/www.milliyet.com.tr/" which redirects to
"http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".

Update second page's datum's metadata to indicate that
"http://www.milliyet.com.tr/" is the representative form.

Updatedb, invertlinks, etc...

While indexing second page, change its "url" field to
"http://www.milliyet.com.tr/".



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doğacan Güney added a comment - 03/Sep/07 07:48 AM
Draft version.

Andrzej Bialecki added a comment - 03/Sep/07 06:12 PM
A few comments:
  • the patch uses a strange diff format ... the first lines of context diffs appear on the same lines as chunk coordinates.
  • in Fetcher[2].handleRedirect(), what happens when the selected reprUrl is the same as the urlString? We should skip the redirect then.
  • the repeating parsing of refreshTime should be hidden in a utility method in ParseStatus - although the proper way to support this would be to extend ParseStatus to store this int value if necessary, i.e. if ParseStatus is SUCCESS_REDIRECT (we would have to bump the version number, too).
  • minimum refreshTime should be at least a constant, or configurable, and not a literal. Similarly the redirType should be a constant.
  • parsing of the redirect time should be moved IMHO to handleRedirect(), to simplify the logic in the FetcherThread.run().
  • if we change the "url" field in BasicIndexingFilter, shouldn't we also change the "site"and "host" fields? We could also consider adding reprUrl as an additional value for the same "url" field - this way we would get hits both on the original url and the reprUrl.
  • I'm not sure why the patch to Indexer.java tries to overwrite reprUrl from fetchDatum with the value from dbDatum - if anything, the value in fetchDatum should be more up to date, no? as it is now, it's silently overwritten. The only way the reprUrl could end up in dbDatum is from a previous updatedb operation, so it should contain an older information.

Doğacan Güney added a comment - 04/Sep/07 11:41 AM
Thanks a lot for the quick review, Andrzej.

> * the patch uses a strange diff format ... the first lines of context diffs appear on the same lines as chunk coordinates.

Sorry about that. I am using git-svn (which, by the way, is an awesome tool) to develop nutch so I may have forgotten to use "svn diff" for the patch.

> * in Fetcher[2].handleRedirect(), what happens when the selected reprUrl is the same as the urlString? We should skip the
> redirect then.

We don't follow reprUrl, we follow newUrl which is tested for equality with urlString. However, we should probably avoid writing reprUrl in crawldatum metadata if it is the same as the urlString.

> * the repeating parsing of refreshTime should be hidden in a utility method in ParseStatus - although the proper way to
> support this would be to extend ParseStatus to store this int value if necessary, i.e. if ParseStatus is SUCCESS_REDIRECT (we
> would have to bump the version number, too).

Good point. Will look into that.

> * minimum refreshTime should be at least a constant, or configurable, and not a literal. Similarly the redirType should be a
> constant.

This patch is only a rough draft. I will fix all such issues in a later patch.

> * if we change the "url" field in BasicIndexingFilter, shouldn't we also change the "site"and "host" fields? [...]

Wow, can't believe I missed that.

> [..] We could also consider adding reprUrl as an additional value for the same "url" field - this way we would get hits both on
> the original url and the reprUrl.

This may cause problems with dedup which assumes that "url" field has a single value. Also, it may be difficult to decide which value of "url" to show in web UI. I also like that fact that "url" is like a UNIQUE KEY for the document. If we allow "url" to have multiple values we lose that.

Perhaps we can add reprUrl to a "repr" field instead?


Andrzej Bialecki added a comment - 10/Sep/07 08:24 PM
> > I'm not sure why the patch to Indexer.java tries to overwrite reprUrl from fetchDatum with the value from dbDatum [..]

I'm still not sure about this issue - could you please clarify?

> Perhaps we can add reprUrl to a "repr" field instead?

Shouldn't this be the other way around - the idea of your patch is to put the data under the reprUrl, so in order to minimize code changes you replace the original url with reprUrl. This way we lose the value of the original url, so it seems to me that if we want to preserve it we should add it to an "orig" field ..


Doğacan Güney added a comment - 10/Sep/07 08:43 PM
>> > I'm not sure why the patch to Indexer.java tries to overwrite reprUrl from fetchDatum with the value from dbDatum [..]
>
> I'm still not sure about this issue - could you please clarify?

Sorry, it seems I forgot to answer it

It is possible that we discover a meta-redirect during parse phase. We have no way of updating fetch datum-s at this point, so instead, parse writes this information to crawl_parse which of course is then passed to crawldb. So, during indexing, it is possible that dbDatum contains (meta-)redirect information while fetchDatum does not. But you are right that we should probably give some sort of priority to fetchDatum's metadata over dbDatum.

> Perhaps we can add reprUrl to a "repr" field instead?
>
> Shouldn't this be the other way around - the idea of your patch is to put the data under the reprUrl, so in order to minimize code changes you
> replace the original url with reprUrl. This way we lose the value of the original url, so it seems to me that if we want to preserve it we should add it
> to an "orig" field ..

OK, this makes sense to me. I guess we should make "orig" both indexed and stored?

Btw, one of the major issues with redirection (that this patch does not solve) is that scores/other information are not reflected in redirections. Assume foo.com is a major web site. Url http://www.foo.com/ redirects to http://www.foo.com/daily.html . People, naturally, are much more likely to link to http://www.foo.com/ then http://www.foo.com/daily.html (the problem is even more interesting if http://foo.com also points to http://www.foo.com/daily.html ). So, I think we must have some a way to "pass" the score from source site to redirection site. Same thing for adaptive crawls: It may look like www.foo.com never changes (since it just redirects to a different url). But it should be considered "modified" whenever page at redirect url is updated.

I am not sure how we can achieve this, though. We will probably need an extra job (that should run at least once before indexing) that merges information from such pages.


Doğacan Güney added a comment - 20/Sep/07 02:05 PM
Draft 2
  • Refresh time is now stored as an extra argument.
  • During indexing, if "url" field is changed, "host" and "site" fields are updated accordingly.
  • Made refreshTime magic number a constant. Made redirType a constant.

Dennis Kubes added a comment - 07/Nov/07 07:22 PM
Doğacan, great work on this patch. I have brought your patch up to the most recent trunk and tested it. Works good. In terms of scoring your patch actually does handle the most common scoring error. That where we wouldn't store or index the source url for temporary refreshes (See NUTCH-572). All unit tests have passed. I am +1 for committing this ASAP.

Dennis Kubes added a comment - 07/Nov/07 07:23 PM
Updated to current trunk.

Doğacan Güney added a comment - 07/Nov/07 08:05 PM
Dennis, thanks for the update and testing. I have been a bit unresponsive lately, I hope that this will change (very) soon.

I actually was going to commit it a while back, but Andrzej said that he is working on adding alias support/etc. (as outlined in his earlier email to nutch-dev) which included this patch as part of his effort but it probably fell off his radar during his move and everything.

Anyway, I feel that lately we have been too reluctant to commit new stuff. So I am also +1 for committing it. If we break something or miss something, we can always fix it later

So I am going to give this one a day or two (say, until sunday) and commit it if there are no objections. Is this OK with you, Dennis?


Dennis Kubes added a comment - 07/Nov/07 08:30 PM
I agree that we should be less reluctant to commit things and hopefully that will pick up the speed of Nutch development. This patch has been out there for a while, all I did was to bring it up to the current trunk. You may want to run it through a final test but I feel that it is ready and if there are larger problems that aren't apparent we can always roll it back. +1 for committing it ASAP. I wouldn't wait until Sunday. Whenever you feel it is ready.

Chris A. Mattmann added a comment - 07/Nov/07 08:35 PM
+1: without having tested it, it improves an existing significant functionality and since we're using a CM, we can always revert, so IMO, go ahead and commit since it's been out there for quite a while.

Doğacan Güney added a comment - 08/Nov/07 01:18 PM
Latest patch committed in rev. 593151 with a trivial modification.

There was a minor typo in Indexer.java (word "protocol" was added to a comment).


Hudson added a comment - 09/Nov/07 05:36 AM