Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.15, 1.16
-
None
-
None
Description
Nutch is unable to parse certain outlinks in pages.
For example:
Crawling http://d4fdot.com/pbfdot/PBC-North_index.asp does not parse the outlinks:
congress_avenue_lighting_improvements.asp
blue_heron_boulevard_bridge_fender_replacement.asp
indiantown_road_intersection_improvements.asp
Crawling http://www.d4fdot.com/pbfdot/index.asp however, parses congress_avenue_lighting_improvements.asp correctly even though the Anchor element is structured similarly.
URL filters and normalizers have been modified to barely operate and no URLs or outlinks are being ignored in the current config and the error still occurs.