Issue Details (XML | Word | Printable)

Key: NUTCH-562
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Chris A. Mattmann
Reporter: Chris A. Mattmann
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

Port mime type framework to use Tika mime detection framework

Created: 29/Sep/07 04:36 AM   Updated: 09/Oct/07 04:30 AM  Due: 30/Sep/07
Return to search
Component/s: mime_type_detector
Affects Version/s: 1.0.0
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works NUTCH-562.Mattmann.patch.txt 2007-10-07 03:32 PM Chris A. Mattmann 309 kB
Java Archive File Licensed for inclusion in ASF works tika-0.1-dev.jar 2007-10-07 03:33 PM Chris A. Mattmann 105 kB
Environment: Mac Book Pro, Intel Core Duo 2.0 Ghz, 2.0 GB RAM, Mac OS X 10.4 although improvement is indep of env
Issue Links:
Reference
 

Resolution Date: 09/Oct/07 12:24 AM


 Description  « Hide
With Tika (http://incubator.apache.org/tika/) nearing a stable 0.1 release candidate, I think it would be a good time to patch Nutch to use Tika's mime detection system (an improvement over the existing Nutch one written primarily by Jerome). Tika's mime system is based on the mime system from Freedesktop.org and includes several improvements over the existing Nutch mime system such as:

1. reliable XML-based content detection (a clear issue plaguing Nutch for some time now), ability to delineate between RSS, XML, ATOM, etc.
2. mime magic pattern matching, including support for multiple patterns
3. glob pattern matches (ability to support > 1)

I'll get together a patch and then attach it to the list once it's relatively stable.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
No work has yet been logged on this issue.