Issue Details (XML | Word | Printable)

Key: NUTCH-411
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Doğacan Güney
Reporter: Doğacan Güney
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

Parse ignores meta refresh redirection

Created: 30/Nov/06 02:35 PM   Updated: 10/Apr/09 12:29 PM
Return to search
Component/s: None
Affects Version/s: 0.9.0
Fix Version/s: 1.0.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works parse-redirect.patch 2006-11-30 02:53 PM Doğacan Güney 1 kB
Issue Links:
Reference
 

Resolution Date: 08/Nov/07 03:03 PM


 Description  « Hide
If fetching and parsing are run as seperate jobs, then redirection coming from meta refresh tag (i.e. <meta http-equiv="refresh" content="0;url=foo/">) is ignored, resulting in the loss of that ("foo/") url.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doğacan Güney added a comment - 30/Nov/06 02:52 PM
My not-necessarily-correct patch for this. We add the new url as a newly discovered url (so it gets initialScore), which is different from what happens if we parse in fetcher.

I believe that in the long term, nutch should associate source url with the redirected url. But this patch (or a more correct version of this) can be applied so that we do not lose urls in the short term.


Doğacan Güney made changes - 30/Nov/06 02:53 PM
Field Original Value New Value
Attachment parse-redirect.patch [ 12346126 ]
Dennis Kubes made changes - 04/Nov/07 09:43 PM
Link This issue relates to NUTCH-572 [ NUTCH-572 ]
Doğacan Güney added a comment - 08/Nov/07 03:03 PM
This is fixed as part of NUTCH-547.

Doğacan Güney made changes - 08/Nov/07 03:03 PM
Resolution Fixed [ 1 ]
Fix Version/s 1.0.0 [ 12312443 ]
Assignee Doğacan Güney [ dogacan ]
Status Open [ 1 ] Closed [ 6 ]