[NUTCH-2033] parse-tika skips valid documents. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.10
Fix Version/s: 1.21
Component/s: parser
Labels:
- mime-type
- parse-tika
- parser
- tika

External issue URL:
https://github.com/b-cube/nutch/commit/d7c29a59fddb682d8f854ef9a89e548c7e2a02de

Description

If we run:

bin/nutch parsechecker -dumpText http://ngdc.noaa.gov/geoportal/openSearchDescription

we’ll get:

Status: failed(2,0): Can't retrieve Tika parser for mime-type application/opensearchdescription+xml

the same occurs for:

bin/nutch parsechecker -dumpText http://petstore.swagger.io/v2/swagger.json

Both perfectly valid documents if they were returned as "application/xml" and "text/plain" respectively.

This happens because parse-tika uses the mime type to retrieve a suitable parser, some composite mime types are not included in this list even though they are perfectly valid and parsable documents. This not taking into account that servers often return incorrect mime types for the documents requested.

We created a helper class as a workaround for this issue. The class uses regex expressions to define synonyms. In the first case any mime type that matches "application/(.*)+xml" will be replaced by "application/xml". This way parse-tika will parse the document just fine.

Attachments

Activity

People

Assignee:: Lewis John McGibbney

Reporter:: Luis Lopez

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 03/Jun/15 19:14

Updated:: 30/Mar/24 17:19