Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-879

Detection problem: message/rfc822 file is detected as text/plain.

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 1.0, 1.1, 1.2
    • 1.18, 2.0.0
    • metadata, mime
    • linux 3.2.9
      oracle jdk7, openjdk7, sun jdk6

    Description

      When using DefaultDetector mime type for .eml files is different (you can test it on testRFC822 and testRFC822_base64 in tika-parsers/src/test/resources/test-documents/).

      Main reason for such behavior is that only magic detector is really works for such files. Even if you set CONTENT_TYPE in metadata or some .eml file name in RESOURCE_NAME_KEY.

      As I found MediaTypeRegistry.isSpecializationOf("message/rfc822", "text/plain") returns false, so detection by MimeTypes.detect(...) works only by magic.

      Attachments

        1. mbox_email_section.txt
          2 kB
          Matthew Caruana Galizia
        2. mime_diffs_A_to_B.html
          1 kB
          Tim Allison
        3. TIKA-879-thunderbird.eml
          0.7 kB
          Sebastian Nagel

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            grossws Konstantin Gribov
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment