[NUTCH-766] Tika parser - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.1
Component/s: None
Labels:
None

Patch Info:

Patch Available

Description

Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.

Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
NUTCH_HOME/lib : tika-core.jar
NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika

Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.

Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika.

The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection, metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.

The following libraries are required in the lib/ directory of the tika-parser :

There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.

Again, your comments are welcome. Please bear in mind that this is just a first step.

Julien
http://www.digitalpebble.com

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-766.v2
28/Jan/10 12:39
93 kB
Julien Nioche
NUTCH-766-v3.patch
01/Feb/10 10:31
92 kB
Julien Nioche
NutchTikaConfig.java
11/Feb/10 07:33
4 kB
Sami Siren
sample.tar.gz
28/Jan/10 12:39
42 kB
Julien Nioche
TikaParser.java
11/Feb/10 07:38
8 kB
Sami Siren

Issue Links

is part of

NUTCH-789 Improvements to Tika parser

Closed

is related to

NUTCH-767 Update Tika to v0.5 for the MimeType detection

Closed

relates to

NUTCH-705 parse-rtf plugin

Closed

Activity

People

Assignee:: Chris A. Mattmann

Reporter:: Julien Nioche

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 18/Nov/09 14:49

Updated:: 28/Sep/10 00:48

Resolved:: 12/Feb/10 06:52