Nutch
  1. Nutch

injector

Summary

Description

Takes a flat file of URLs and adds them to the crawldb as pages to be crawled

Issues: Unresolved

Key Summary Due Date
Bug NUTCH-1472 InvalidRequestException(why:(String didn't validate.) [webpage][f][ts] failed validation)
Improvement NUTCH-1712 Use MultipleInputs in Injector to make it a single mapreduce job
Bug NUTCH-1746 OutOfMemoryError in Mappers

View Issues

Issues: Updated recently

Key Summary Updated
New Feature NUTCH-2038 Naive Bayes classifier based html Parse filter (for filtering outlinks)
Improvement NUTCH-1712 Use MultipleInputs in Injector to make it a single mapreduce job
Improvement NUTCH-2046 The crawl script should be able to skip an initial injection.

View Issues

Versions: Unreleased

Name Release date
Unreleased 2.4  
Unreleased 1.11  
Unreleased 2.3.1