
|
If you were logged in you would be able to see more operations.
|
|
|
|
File Attachments:
|
|
|
Environment:
|
n/a
|
|
Issue Links:
|
Dependants
|
|
|
|
This issue blocks:
|
|
NUTCH-353
pages that serverside forwards will be refetched every time
|
|
|
|
|
Incorporates
|
|
|
|
This issue is part of:
|
|
NUTCH-322
Fetcher discards ProtocolStatus, doesn't store redirected pages
|
|
|
|
|
Reference
|
|
|
|
This issue is related to:
|
|
NUTCH-371
DeleteDuplicates should remove documents with duplicate URLs
|
|
|
|
|
|
|
| Resolution Date: |
28/Dec/06 12:18 AM
|
|
[Excerpt from maillist, sender: Andrzej Bialecki]
When a page is redirected, the original url is NOT updated - so, CrawlDB will never know that a redirect occured, it won't even know that a fetch occured... This looks like a bug.
In 0.7 this was recorded in the segment, and then it would affect the Page status during updatedb. It should do so 0.8, too...
|
|
Description
|
[Excerpt from maillist, sender: Andrzej Bialecki]
When a page is redirected, the original url is NOT updated - so, CrawlDB will never know that a redirect occured, it won't even know that a fetch occured... This looks like a bug.
In 0.7 this was recorded in the segment, and then it would affect the Page status during updatedb. It should do so 0.8, too...
|
Show » |
| No work has yet been logged on this issue.
|
|