Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2147

MetadataScoringFilter for Nutch

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.10
    • None
    • plugin, scoring
    • None

    Description

      This issue originally started by envisioning an implementation of a LanguagePreferenceScoringFilter so that Nutch could easily be made into a directed crawler based on crawl administrator ranking preferences of languages we wish to crawl.
      Right now this is not possible.
      We already detect and index language within the language-identifier plugin as well as within parse-tika irrc, however currently the presence of a language does not effect scoring of pages.

      The scope of this issue has changed to make it more generally applicable for a wider variety of use cases. This will therefore take advantage of NUTCH-1980 by pulling (amongst other things) Language entries from the CrawlDB Metadata.

      Attachments

        Activity

          People

            lewismc Lewis John McGibbney
            lewismc Lewis John McGibbney
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: