Description
This issue originally started by envisioning an implementation of a LanguagePreferenceScoringFilter so that Nutch could easily be made into a directed crawler based on crawl administrator ranking preferences of languages we wish to crawl.
Right now this is not possible.
We already detect and index language within the language-identifier plugin as well as within parse-tika irrc, however currently the presence of a language does not effect scoring of pages.
The scope of this issue has changed to make it more generally applicable for a wider variety of use cases. This will therefore take advantage of NUTCH-1980 by pulling (amongst other things) Language entries from the CrawlDB Metadata.