Description
Note: I'm using Nutch to verify that each of a long list of URIs is good, so I use them all as seeds in a single-iteration crawls.
Some seed URIs are mangled by Nutch, and Nutch then reports on the mangled versions (which are no good) instead of the original ones (which are good). Two patterns have emerged from my tests:
(1) If the query portion of the URI contains '//', it becomes '/', rendering the resource unfetchable. Example:
(2) If the URI has a trailing '.', it disappears, apparently rendering the resource unfetchable. Example:
http://www.irs.gov/Individuals/ITIN-Policy-Change-Summary-for-2013.
Both of the above are known good URIs. When they are used as seeds, Nutch 1.7 doesn't report about them, but instead reports about URIs that have been mangled as described above. In the '//' -> '/' case, Nutch reports that robot access is denied, which is probably true. In the trailing '.' case, Nutch says there's no such resource, which is true, but it's not the question I was trying to get Nutch to answer.)