[NUTCH-2147] MetadataScoringFilter for Nutch - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.10
Fix Version/s: None
Component/s: plugin, scoring
Labels:
None

Description

This issue originally started by envisioning an implementation of a LanguagePreferenceScoringFilter so that Nutch could easily be made into a directed crawler based on crawl administrator ranking preferences of languages we wish to crawl.
Right now this is not possible.
We already detect and index language within the language-identifier plugin as well as within parse-tika irrc, however currently the presence of a language does not effect scoring of pages.

The scope of this issue has changed to make it more generally applicable for a wider variety of use cases. This will therefore take advantage of ~~NUTCH-1980~~ by pulling (amongst other things) Language entries from the CrawlDB Metadata.

Attachments

Activity

People

Assignee:: Lewis John McGibbney

Reporter:: Lewis John McGibbney

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 20/Oct/15 23:25

Updated:: 12/Jun/18 19:51