Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1928

Indexing filter of documents by the MIME type

    XMLWordPrintableJSON

    Details

    • Patch Info:
      Patch Available

      Description

      This allows to filter the indexed documents by the MIME type property of the crawled content. Basically this will allow you to restrict the MIME type of the contents that will be stored in Solr/Elasticsearch index without the need to restrict the crawling/parsing process, so no need to use URLFilter plugin family. Also this address one particular corner case when certain URLs doesn't have any format to filter such as some RSS feeds (http://www.awesomesite.com/feed) and it will end in your index mixed with all your HTML content.

      A configuration can file specified on the mimetype.filter.file property in the nutch-site.xml. This file use the same format as the urlfilter-suffix plugin. If no mimetype.filter.file key is found an allow all policy is used instead, so all your crawled documents will be indexed.

        Attachments

        1. mimetype-patch-v3.patch
          18 kB
          Jorge Luis Betancourt Gonzalez
        2. NUTCH-1928v4.patch
          22 kB
          Lewis John McGibbney
        3. NUTCH-1928v5.patch
          23 kB
          Jorge Luis Betancourt Gonzalez
        4. NUTCH-1928v6.patch
          23 kB
          Jorge Luis Betancourt Gonzalez

          Activity

            People

            • Assignee:
              jorgelbg Jorge Luis Betancourt Gonzalez
              Reporter:
              jorgelbg Jorge Luis Betancourt Gonzalez
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: