Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
nutchgora, 1.6
-
None
-
None
-
None
-
Patch Available
Description
The default rules of URLNormalizerRegex remove the anchor up to the first
occurrence of ? or &. The remaining part of the anchor is kept
which may cause a large, possibly infinite number of outlinks when the same document
fetched again and again with different URLs,
see http://www.mail-archive.com/user%40nutch.apache.org/msg05940.html
Parameters in inner-page anchors are a common practice in AJAX web sites.
Currently, crawling AJAX content is not supported (NUTCH-1323).