Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Auto Closed
-
2.2
-
None
-
None
-
Patch Available
Description
AFAIS in nutch 1.x you could change your url filters and normalizers during the crawl, and update the db using crawldb -normalize -filter. There does not seem to be a away to achieve the same in nutch 2.x?
Anyway, I went ahead and tried to implement -normalize and -filter for the nutch 2.x updatedb command. I have no experience with any of the used technologies including java, so please check the attached code carefully before using it. I'm very interested to hear if this is the right approach or any other comments.