Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
Patch Available
Description
The WebGraph programs performs URL normalization. Since normalization of outlinks is already performed during the parse it should become optional. There is also no URL filtering mechanism in the web graph program. When a CrawlDatum is removed from the CrawlDB by an URL filter is should be possible to remove it from the web graph as well.
Attachments
Attachments
Issue Links
- is duplicated by
-
NUTCH-1144 Filtering optional in WebGraph
- Closed
-
NUTCH-1171 WebGraph to overwrite normalized input keys
- Closed