[NUTCH-1658] Nutch mangles seed URLs and then reports on the mangled ones - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: 1.7
Component/s: None
Labels:
- newbie
Environment:

Ubuntu 12.04

Description

Note: I'm using Nutch to verify that each of a long list of URIs is good, so I use them all as seeds in a single-iteration crawls.

Some seed URIs are mangled by Nutch, and Nutch then reports on the mangled versions (which are no good) instead of the original ones (which are good). Two patterns have emerged from my tests:

(1) If the query portion of the URI contains '//', it becomes '/', rendering the resource unfetchable. Example:

https://www.pay.gov/paygov/forms/formInstance.html?nc=1356014395287&agencyFormId=44568890&userFormSearch=https%3A//www.pay.gov/paygov/keywordSearchForms.html%3FshowingDetails=true&showingAll=false&sortProperty=agencyFormName&totalResults=1&keyword=apma&ascending=true&pageOffset=0

(2) If the URI has a trailing '.', it disappears, apparently rendering the resource unfetchable. Example:

http://www.irs.gov/Individuals/ITIN-Policy-Change-Summary-for-2013.

Both of the above are known good URIs. When they are used as seeds, Nutch 1.7 doesn't report about them, but instead reports about URIs that have been mangled as described above. In the '//' -> '/' case, Nutch reports that robot access is denied, which is probably true. In the trailing '.' case, Nutch says there's no such resource, which is true, but it's not the question I was trying to get Nutch to answer.)

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Steve Newcomb

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 30/Oct/13 16:57

Updated:: 31/Oct/13 12:54

Resolved:: 31/Oct/13 12:48