Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
1.0.0
-
None
-
None
-
None
-
Mac Book Pro, Intel Core Duo 2.0 Ghz, 2.0 GB RAM, Mac OS X 10.4 although improvement is indep of env
Description
With Tika (http://incubator.apache.org/tika/) nearing a stable 0.1 release candidate, I think it would be a good time to patch Nutch to use Tika's mime detection system (an improvement over the existing Nutch one written primarily by Jerome). Tika's mime system is based on the mime system from Freedesktop.org and includes several improvements over the existing Nutch mime system such as:
1. reliable XML-based content detection (a clear issue plaguing Nutch for some time now), ability to delineate between RSS, XML, ATOM, etc.
2. mime magic pattern matching, including support for multiple patterns
3. glob pattern matches (ability to support > 1)
I'll get together a patch and then attach it to the list once it's relatively stable.
Attachments
Attachments
Issue Links
- relates to
-
NUTCH-185 XMLParser is configurable xml parser plugin.
- Closed