[TIKA-688] Enhance content-type detector to recognize almost plain text - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.9
Fix Version/s: 0.10
Component/s: mime
Labels:
None

Description

I am using TIKA to convert a collection of documents that includes files named something.txt. I use the Tika#parse(InputStream) interface to get a parser that auto detects content. The files are almost plain text – the documents have a scattering of control characters in them. On these text files the reader given to me by the Tika#parse() method immediately returns null. After some experimentation I found that a single control K character early in the file will cause the mime type detector to give up and label it application/octet-stream. Please consider adding a recognizer because it would be great if Tika could clean up these files by dropping text characters. I note that if I drop this file into the Tika GUI, or if I invoke Tika on the command line it does well, and I think this behavior is obtained by using the file name as a hint. I probably should be using a different Tika method, trying to figure that out next. Thanks for listening.

Attachments

Activity

People

Assignee:: Jukka Zitting

Reporter:: Chris Lott

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 09/Aug/11 18:56

Updated:: 20/Oct/11 12:34

Resolved:: 17/Sep/11 11:51