Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-562

Port mime type framework to use Tika mime detection framework

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.0.0
    • None
    • None
    • None
    • Mac Book Pro, Intel Core Duo 2.0 Ghz, 2.0 GB RAM, Mac OS X 10.4 although improvement is indep of env

    Description

      With Tika (http://incubator.apache.org/tika/) nearing a stable 0.1 release candidate, I think it would be a good time to patch Nutch to use Tika's mime detection system (an improvement over the existing Nutch one written primarily by Jerome). Tika's mime system is based on the mime system from Freedesktop.org and includes several improvements over the existing Nutch mime system such as:

      1. reliable XML-based content detection (a clear issue plaguing Nutch for some time now), ability to delineate between RSS, XML, ATOM, etc.
      2. mime magic pattern matching, including support for multiple patterns
      3. glob pattern matches (ability to support > 1)

      I'll get together a patch and then attach it to the list once it's relatively stable.

      Attachments

        1. NUTCH-562.Mattmann.patch.txt
          309 kB
          Chris A. Mattmann
        2. tika-0.1-dev.jar
          105 kB
          Chris A. Mattmann

        Issue Links

          Activity

            People

              chrismattmann Chris A. Mattmann
              chrismattmann Chris A. Mattmann
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: