Issue Details (XML | Word | Printable)

Key: NUTCH-562
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Chris A. Mattmann
Reporter: Chris A. Mattmann
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

Port mime type framework to use Tika mime detection framework

Created: 29/Sep/07 04:36 AM   Updated: 09/Oct/07 04:30 AM  Due: 30/Sep/07
Return to search
Component/s: mime_type_detector
Affects Version/s: 1.0.0
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works NUTCH-562.Mattmann.patch.txt 2007-10-07 03:32 PM Chris A. Mattmann 309 kB
Java Archive File Licensed for inclusion in ASF works tika-0.1-dev.jar 2007-10-07 03:33 PM Chris A. Mattmann 105 kB
Environment: Mac Book Pro, Intel Core Duo 2.0 Ghz, 2.0 GB RAM, Mac OS X 10.4 although improvement is indep of env
Issue Links:
Reference
 

Resolution Date: 09/Oct/07 12:24 AM


 Description  « Hide
With Tika (http://incubator.apache.org/tika/) nearing a stable 0.1 release candidate, I think it would be a good time to patch Nutch to use Tika's mime detection system (an improvement over the existing Nutch one written primarily by Jerome). Tika's mime system is based on the mime system from Freedesktop.org and includes several improvements over the existing Nutch mime system such as:

1. reliable XML-based content detection (a clear issue plaguing Nutch for some time now), ability to delineate between RSS, XML, ATOM, etc.
2. mime magic pattern matching, including support for multiple patterns
3. glob pattern matches (ability to support > 1)

I'll get together a patch and then attach it to the list once it's relatively stable.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Chris A. Mattmann added a comment - 29/Sep/07 04:30 PM
This will allow an XML parser to properly be called because the appropriate mime type is detected.

Chris A. Mattmann added a comment - 07/Oct/07 03:32 PM - edited
Initial patch for comments:

1. This patch removes the MimeType system, and its associated java src files, config files and unit tests from Nutch. This information is in Tika now and is replaced by its TIka counterparts.
2. This patch uses the unreleased 0.1-dev version of Tika. When 0.1 is officially released, we can convert to that, though I don't anticipate any MimeType API changes between now and then.
3. All unit tests for core and plugins pass, however, it's probably a good idea to run at least a small crawl with this patch and see if everything works fine. I don't really have the time for this now, so anyone want to try? (cough cough Dogacan cough cough )
4. It's worth noting that this MimeType system from Tika changes the traditional Nutch mime type system (IMO for the better) in a couple of ways. First, whereas the old MimeType system was very happy to return null in places where it couldn't figure out the MimeType, this system tries to return a "default" MimeType (which in this case is "application/octet-stream") if it can't guess the mime type from those that it knows about. Second, this mime type system uses a different type of XML repo file – based on the one available from freedesktop.org's shared MIME package.

Okay, so if someone gets a chance please run a small crawl with this in the next few days and let us know how it works. Otherwise, I'll do the same myself in a couple days and if there are no objections, I'd like to commit this then.


Chris A. Mattmann added a comment - 07/Oct/07 03:33 PM
Tika 0.1 unrelased jar file – drop this in $NUTCH_SRC_HOME/lib

Chris A. Mattmann added a comment - 09/Oct/07 12:24 AM
  • Applied patch, with minor changes to use static version of MimeUtils Tika interface, and to only instantiate once per object family
  • Tested on small crawl of apache.org sites, mime type set appropriately

Chris A. Mattmann added a comment - 09/Oct/07 12:24 AM
  • Patch applied to trunk in r583016

Hudson added a comment - 09/Oct/07 04:30 AM