Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1258

MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.4
    • 1.5
    • indexer
    • None
    • Patch Available

    Description

      The MoreIndexingFilter reads the Content-Type from parse metadata. However, this usually contains a lot of crap because web developers can set it to anything they like. The filter must be able to read the Content-Type field from content metadata as well because that contains the type detected by Tika's Detector.

      Attachments

        1. NUTCH-1258-v2.patch
          2 kB
          Julien Nioche
        2. NUTCH-1258-1.5-1.patch
          2 kB
          Markus Jelsma

        Issue Links

          Activity

            People

              markus17 Markus Jelsma
              markus17 Markus Jelsma
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: