Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3771

Regression from TIKA-3687: Files wrongly detected as EML

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 2.4.1
    • None
    • None

    Description

      Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, I detected some hundreds of samples from 1M of different file types now are being detected as EML. This is caused by the <match value="\nX-" type="string" offset="0:1024"/> rule added in TIKA-3687 in the minShouldMatch="2" clause. Attached is a sample PNG file that triggers this (it also has another \nDate: value in the first 1024 bytes).

      Another not related thing, I tried to override the message/rfc822 mime definition with a custom-tika-mimetypes.xml in classpath, but it had no effect. It used to work in Tika-1.x. Was that change intentional? I think user definitions should take precedence over Tika definitions, since they can change depending on domain or context (e.g. the same extension may be used by different applications). If it wasn't intentional, I'll open other issue.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            lfcnassif Luís Filipe Nassif
            lfcnassif Luís Filipe Nassif
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment