|
[
Permlink
| « Hide
]
Doğacan Güney added a comment - 03/Sep/07 07:48 AM
Draft version.
A few comments:
Thanks a lot for the quick review, Andrzej.
> * the patch uses a strange diff format ... the first lines of context diffs appear on the same lines as chunk coordinates. Sorry about that. I am using git-svn (which, by the way, is an awesome tool) to develop nutch so I may have forgotten to use "svn diff" for the patch. > * in Fetcher[2].handleRedirect(), what happens when the selected reprUrl is the same as the urlString? We should skip the We don't follow reprUrl, we follow newUrl which is tested for equality with urlString. However, we should probably avoid writing reprUrl in crawldatum metadata if it is the same as the urlString. > * the repeating parsing of refreshTime should be hidden in a utility method in ParseStatus - although the proper way to Good point. Will look into that. > * minimum refreshTime should be at least a constant, or configurable, and not a literal. Similarly the redirType should be a This patch is only a rough draft. I will fix all such issues in a later patch. > * if we change the "url" field in BasicIndexingFilter, shouldn't we also change the "site"and "host" fields? [...] Wow, can't believe I missed that. > [..] We could also consider adding reprUrl as an additional value for the same "url" field - this way we would get hits both on This may cause problems with dedup which assumes that "url" field has a single value. Also, it may be difficult to decide which value of "url" to show in web UI. I also like that fact that "url" is like a UNIQUE KEY for the document. If we allow "url" to have multiple values we lose that. Perhaps we can add reprUrl to a "repr" field instead? > > I'm not sure why the patch to Indexer.java tries to overwrite reprUrl from fetchDatum with the value from dbDatum [..]
I'm still not sure about this issue - could you please clarify? > Perhaps we can add reprUrl to a "repr" field instead? Shouldn't this be the other way around - the idea of your patch is to put the data under the reprUrl, so in order to minimize code changes you replace the original url with reprUrl. This way we lose the value of the original url, so it seems to me that if we want to preserve it we should add it to an "orig" field .. >> > I'm not sure why the patch to Indexer.java tries to overwrite reprUrl from fetchDatum with the value from dbDatum [..]
> > I'm still not sure about this issue - could you please clarify? Sorry, it seems I forgot to answer it It is possible that we discover a meta-redirect during parse phase. We have no way of updating fetch datum-s at this point, so instead, parse writes this information to crawl_parse which of course is then passed to crawldb. So, during indexing, it is possible that dbDatum contains (meta-)redirect information while fetchDatum does not. But you are right that we should probably give some sort of priority to fetchDatum's metadata over dbDatum. > Perhaps we can add reprUrl to a "repr" field instead? OK, this makes sense to me. I guess we should make "orig" both indexed and stored? — Btw, one of the major issues with redirection (that this patch does not solve) is that scores/other information are not reflected in redirections. Assume foo.com is a major web site. Url http://www.foo.com/ I am not sure how we can achieve this, though. We will probably need an extra job (that should run at least once before indexing) that merges information from such pages. Draft 2
Doğacan, great work on this patch. I have brought your patch up to the most recent trunk and tested it. Works good. In terms of scoring your patch actually does handle the most common scoring error. That where we wouldn't store or index the source url for temporary refreshes (See
Dennis, thanks for the update and testing. I have been a bit unresponsive lately, I hope that this will change (very) soon.
I actually was going to commit it a while back, but Andrzej said that he is working on adding alias support/etc. (as outlined in his earlier email to nutch-dev) which included this patch as part of his effort but it probably fell off his radar during his move and everything. Anyway, I feel that lately we have been too reluctant to commit new stuff. So I am also +1 for committing it. If we break something or miss something, we can always fix it later So I am going to give this one a day or two (say, until sunday) and commit it if there are no objections. Is this OK with you, Dennis? I agree that we should be less reluctant to commit things and hopefully that will pick up the speed of Nutch development. This patch has been out there for a while, all I did was to bring it up to the current trunk. You may want to run it through a final test but I feel that it is ready and if there are larger problems that aren't apparent we can always roll it back. +1 for committing it ASAP. I wouldn't wait until Sunday. Whenever you feel it is ready.
+1: without having tested it, it improves an existing significant functionality and since we're using a CM, we can always revert, so IMO, go ahead and commit since it's been out there for quite a while.
Latest patch committed in rev. 593151 with a trivial modification.
There was a minor typo in Indexer.java (word "protocol" was added to a comment). Integrated in Nutch-Nightly #261 (See http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/261/
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||