[TIKA-2443] Plain text file identified as rfc822 and which can cause StackOverflowError - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.11, 1.16
Fix Version/s: 1.25
Component/s: mime
Labels:
None

Description

I have a file called test.txt, containing only:
Date: 06/25/2014 15:54:19
And some more text I am writing. This will
be detected as rfc822

This file is detected and parsed as message/rfc822.
I think the magic rule on "Date: " is too strong and it should have detected only as plain/text file. It looks to me like the reverse of https://issues.apache.org/jira/browse/TIKA-879

We noticed this issue, because we have a large log file, which has many lines with Date, Log level and Message which is parsed as message/rfc822 (only because it starts with "Date:") and which throws
StackOverflowError in the end.

Is there some workaround to make this rule weaker ? through configuration ?
We use DefaultParser and everything default. We use tika in 1.11 version, but we tried also with tika 1.16 and we saw the same StackOverflowError (which probably again happened because it was parsed as a rc822 type).
The only workaround that I found was to add

custom-mimetypes.xml like this
<mime-type type="text/plain">
<magic priority="70">
<match value="Date:" type="string" offset="0"/>
</magic>
</mime-type>
Would you recomend some other workaround to make sure the file does not get parsed as rfc822 ?
And I have another question: can this custom-mimetypes.xml be specified from an external location?

Many thanks.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Viorica Visan

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 15/Aug/17 15:37

Updated:: 21/Jul/21 14:43

Resolved:: 21/Jul/21 14:43