Issue Details (XML | Word | Printable)

Key: NUTCH-353
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Andrzej Bialecki
Reporter: Stefan Groschupf
Votes: 3
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Nutch

pages that serverside forwards will be refetched every time

Created: 18/Aug/06 04:50 AM   Updated: 10/Apr/09 12:29 PM
Component/s: None
Affects Version/s: 0.8.1, 0.9.0
Fix Version/s: 1.0.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works doNotRefecthForwarderPagesV1.patch 2006-08-18 04:50 AM Stefan Groschupf 0.7 kB
Issue Links:
Dependants
 

Resolution Date: 03/Feb/09 01:19 PM


 Description  « Hide
Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed.
This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Stefan Groschupf added a comment - 18/Aug/06 04:50 AM
Since we discussed that nutch need to be more polite we should fix that asap.

King Kong added a comment - 15/Sep/06 04:17 AM
this is a really serious problem. because the orginal url are fetched again and again

I argee with stefan's solution .

I think this problem should attract more people's attention.


Andrzej Bialecki added a comment - 23/Sep/06 05:41 PM
I think this issue requires more discussion, especially how it affects the linkdb.

Let's say that page A links to B, but B redirects to C. Issues to discuss:

  • should we mark B as gone? we could do so, to prevent refetching. We should also store the redirect url in CrawlDatum.metaData. This redirect url may change in the future to some other value, but since no page is ever truly gone (we should retry it at some point in the future) we should be able to adjust the redirect info.
  • for all practical purposes, C now becomes a replacement for B. Should we transfer all inlink information (anchor text, incoming urls, and score contributions) to C? From the implementation point of view this would require changes to linkdb format, to be able to create "aliases" that automatically transfer all inlink information to C even though it's inserted under B ..

Doug Cook added a comment - 02/Oct/06 06:10 PM
This is definitely a complex issue. It is also high priority – issues with redirects and duplicates, which URL is chosen, and what happens to the anchor text for the pages involved are causing significant relevance issues.
A few observations:

(1) A redirect target is not always the canonical version of a URL. For example, is very common for root-level pages to redirect to an internal home page (some 30% of the root pages in my index do so). However, the root pages have all the anchor text and are truly the canonical, permanent version of the page; the internal redirect target is just the "temporary" homepage, and could change at any time depending on the site implementation. Here are some examples:
http://www.landwirtschaft-bw.info/
http://www.dlr-rnh.rlp.de/
http://www.niederoesterreich.at/
Because of the current policy of "discarding" the redirect source, I lose 30% of the home pages in my index, which makes my relevance very poor for navigational queries.

In this case, we would likely want to mark the internal redirect target as an alias as Andrzej suggests, and automatically transfer any link information to the root page.

(2) There may be other cases where we want to alias two pages, either to avoid recrawling them, or to merge anchor text. Suppose we crawl both
http://www.x.com/
and
http://www.x.com/index.html
and these are the same document.

Right now we will always crawl both of these, and the dedup algorithm will pick one (sadly often the /index.html version due to strange score anomalies), and throw out the anchor text for the other. While we can't safely normalize these two URLs to be the same in advance of seeing the content, once we see that the signatures are the same, we can, and should, merge them so that the index.html version is marked as an alias of the / version, and future crawls simply skip crawling the /index.html version and transfer its link information to the / page.

This problem, like the first one, is causing me to lose root-level URLs along with their anchor text, further affecting relevance for navigational queries.

In short, I agree with Andrzej that we need a way to mark a URL as an alias of another, to avoid recrawl, and to merge link information. We need to be careful, however, of which URL we pick. It is not always the redirect target that should win. And some of our current concept of "duplicates" should also be subsumed under the new notion of "alias."

I'm happy to help out in any way with a fix. I'm just looking at hacking together something in my own environment because the problems are affecting me so severely, but as I'm new-ish to Nutch, what I come up with might not be as elegant or flexible as what others might envision...


Ken Krugler added a comment - 02/Oct/06 08:26 PM
+1 that the redirect target is not always the "real" URL that we want to keep.

For example, http://www.ibm.com/developerworks/lotus/downloads/toolkits.html => http://www-128.ibm.com/developerworks/lotus/downloads/toolkits.html. This holds true for most (all?) developerWorks pages; they redirect to www-128.ibm.com/<whatever>, but IBM would love for the URL everybody sees to still be www.ibm.com/<whatever>.


Doug Cutting added a comment - 03/Oct/06 10:43 PM
It's worth noting that Google, Yahoo! and Microsoft's searches all return lots of links to www-XXX.ibm.com. Just some evidence that this may not be an easy problem to solve.

Uros Gruber added a comment - 05/Oct/06 07:17 PM
I don't think there is 100% solution. Mostly because not all respect standards. For example www.imb.com uses 302 status code which by RFC definition - (The requested resource resides temporarily under a different URI. Since the redirection might be altered on occasion, the client SHOULD continue to use the Request-URI for future requests. This response is only cacheable if indicated by a Cache-Control or Expires header field. ). This case is clear. We should use original URL.

But then there is also permanent redirect which SHOULD replace old URL and also update all links pointing to old URL with new one.

I also saw some examples of wrong redirections. One of them was my fault to. I use Alias definition with apache server for accepting connections without www subdomain. And then with the page I left link to main page pointing to index.php instead of just /. After a while my domain.si/index.php became more important than www.domain.si (bot points to the same site)

So as I see this job is not simple at all. Maybe we need a schema or some sort of flow diagram to indicate what to do in determinant situation.

I hope my notes helps a bit because at the moment we really have a lot of unwanted urls in our index.


Ken Krugler added a comment - 20/Jan/07 06:27 PM
Another small note about this (see NUTCH-411 for a related but different problem) ...

If a page (e.g. http://boutell.com) returns a meta refresh header (e.g. <meta http-equiv="refresh" content="0;url=http://www.boutell.com/">), and you also wind up fetching the target page independently, then it looks like you can wind up with both pages in the crawl results. One entry has a title like "boutell.com", while the other has the real page title. Or at least I've seen this a few times in our crawl results.


Ken Krugler added a comment - 20/Jan/07 06:29 PM
Wait, looks like maybe change 490607 (fix for NUTCH-273) might fix the issue I just described in my previous comment. I don't think our latest public crawl was done with this patch.

Andrzej Bialecki added a comment - 20/Jan/07 10:23 PM
I believe the patch in NUTCH-273 fixes a large part of the problem, that you describe - we record the fact that there was a redirect, and Indexer indexes only the final page.

The other parts though (correct treatment of inlink information, and selection of representative pages for chains of redirects) is not addressed yet.


Doug Cook added a comment - 20/Jan/07 11:36 PM
I have a local fix for this problem (partly Paul Gauthier's work, partly mine) that I have been testing for some time. It's a little bit of a hack, but it's much better than just indexing the redirect target (which is the wrong behavior in many instances; see comments earlier).

The fix is to index both instances of the page, both the source and the target, making sure that the outlinks from the target page are only assigned to the target page. This way, in the (frequent) case that the redirect source is the canonical version of the page, with more anchor text, it will show up for searches. The fix seems to work pretty well, and solves a significant percentage of Nutch's "missing home pages" problem without using much extra space in the index. If it sounds useful to anyone, I'm happy to contribute it back.

Doug


Chris A. Mattmann added a comment - 20/Jan/07 11:44 PM
Doug,

Let's see what you got. I'd be happy to take a look at it.

Cheers,
Chris


Andrzej Bialecki added a comment - 19/Mar/07 11:49 PM
This i partially fixed so that page status is consistent. LinkDb related changes will be implemented later.

Andrzej Bialecki added a comment - 03/Feb/09 01:19 PM
Actually, the problem in the issue description is solved now. I'm closing this one, and the remaining functionality should be tracked as an enhancement in a separate issue.