Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2042

MBOX file detected wrongly as text/html

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 1.14
    • Component/s: None
    • Labels:
      None
    • Environment:

      Ubuntu 14.04, Apache Tika 1.13 and 1.14 nightly at the time of this writing

      Description

      MBOX file doesn't get recognized via "magic detection" mechanism as "application/mbox", but wrongly as "text/html".

      Workaround for this in Tika 1.13 is achieved by placing following in custom-mimetypes.xml, as suggested on mailing list (priority has to be larger than message/rfc822):
      <mime-type type="application/mbox">
      <magic priority="70">
      <match value="From " type="string" offset="0"/>
      </magic>
      <glob pattern="*.mbox"/>
      </mime-type>

      Sample MBOX file is attached.

      1. mbox_email_section.txt
        2 kB
        Matthew Caruana Galizia
      2. mbox_header.txt
        1 kB
        Matthew Caruana Galizia
      3. clojure.mbox
        199 kB
        Vjeran Marcinko

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build Tika-trunk #1337 (See https://builds.apache.org/job/Tika-trunk/1337/)
          TIKA-2042 – fix typo. (tallison: https://github.com/apache/tika/commit/00221ad34a18f501d967979288236b98bd4cec58)

          • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Tika-trunk #1337 (See https://builds.apache.org/job/Tika-trunk/1337/ ) TIKA-2042 – fix typo. (tallison: https://github.com/apache/tika/commit/00221ad34a18f501d967979288236b98bd4cec58 ) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build Tika-trunk #1335 (See https://builds.apache.org/job/Tika-trunk/1335/)
          Two more EML header magics from Matthew Caruana Galizia from TIKA-2042 (nick: https://github.com/apache/tika/commit/0579efe59c35a4db248d8a5547af05fbee3caad4)

          • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Tika-trunk #1335 (See https://builds.apache.org/job/Tika-trunk/1335/ ) Two more EML header magics from Matthew Caruana Galizia from TIKA-2042 (nick: https://github.com/apache/tika/commit/0579efe59c35a4db248d8a5547af05fbee3caad4 ) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Hide
          gagravarr Nick Burch added a comment -

          Matthew Caruana Galizia I've added some more rfc822 magic, which I think should solve it for this most recent file. If I trim off everything up the mbox From, then I can detect the outer file as mbox, and the embedded one as rfc822.

          Do please let us know if there's any more you come across!

          Show
          gagravarr Nick Burch added a comment - Matthew Caruana Galizia I've added some more rfc822 magic, which I think should solve it for this most recent file. If I trim off everything up the mbox From, then I can detect the outer file as mbox, and the embedded one as rfc822. Do please let us know if there's any more you come across!
          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          See Tika-879. Looks like widening the magic search helped to detect more emls in the test corpus. Tim Allison do you remember if that resulted in lots of false positives?

          Show
          lfcnassif Luis Filipe Nassif added a comment - See Tika-879. Looks like widening the magic search helped to detect more emls in the test corpus. Tim Allison do you remember if that resulted in lots of false positives?
          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          This problem is very very recurrent. I think we should search for some of the eml magics in a larger range.

          Show
          lfcnassif Luis Filipe Nassif added a comment - This problem is very very recurrent. I think we should search for some of the eml magics in a larger range.
          Hide
          mcaruanagalizia Matthew Caruana Galizia added a comment - - edited

          I've attached a sample of one of the message sections from the MBOX. Detected as text/html instead of message/rfc822.

          Show
          mcaruanagalizia Matthew Caruana Galizia added a comment - - edited I've attached a sample of one of the message sections from the MBOX. Detected as text/html instead of message/rfc822.
          Hide
          mcaruanagalizia Matthew Caruana Galizia added a comment -

          Nick Burch thank you - that fixes the detection of at least one of the MBOX files. Now the problem is that that when the email streams get passed to the delegate parser by the ParsingEmbeddedDocumentExtractor implementation, they're detected as text/html instead of message/rfc822.

          Show
          mcaruanagalizia Matthew Caruana Galizia added a comment - Nick Burch thank you - that fixes the detection of at least one of the MBOX files. Now the problem is that that when the email streams get passed to the delegate parser by the ParsingEmbeddedDocumentExtractor implementation, they're detected as text/html instead of message/rfc822.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build Tika-trunk #1331 (See https://builds.apache.org/job/Tika-trunk/1331/)
          TIKA-2042 Add a few more mbox patterns, based on file supplied by (nick: https://github.com/apache/tika/commit/0277fbb92c4714361949d59708db2a8734f1b1f2)

          • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Tika-trunk #1331 (See https://builds.apache.org/job/Tika-trunk/1331/ ) TIKA-2042 Add a few more mbox patterns, based on file supplied by (nick: https://github.com/apache/tika/commit/0277fbb92c4714361949d59708db2a8734f1b1f2 ) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Hide
          gagravarr Nick Burch added a comment -

          Matthew Caruana Galizia I've added some more patterns in 0277fbb92c4714361949d59708db2a8734f1b1f2 based on the mbox you've supplied. Could you try that, and let us have redacted copies of any remaining mbox files you have which that doesn't fix things for?

          Show
          gagravarr Nick Burch added a comment - Matthew Caruana Galizia I've added some more patterns in 0277fbb92c4714361949d59708db2a8734f1b1f2 based on the mbox you've supplied. Could you try that, and let us have redacted copies of any remaining mbox files you have which that doesn't fix things for?
          Hide
          mcaruanagalizia Matthew Caruana Galizia added a comment -

          Header attached with identifying information stripped out. This file is detected as text/html instead of application/mbox.

          Show
          mcaruanagalizia Matthew Caruana Galizia added a comment - Header attached with identifying information stripped out. This file is detected as text/html instead of application/mbox.
          Hide
          mcaruanagalizia Matthew Caruana Galizia added a comment - - edited

          I'd like to ask for this issue to be reopened. Around half the MBOX files in our corpus are being detected as text/html. My guess is that there are two reasons for this:

          1) the files have no extension - the filenames are literally "mbox" rather than "*.mbox" (I think this is the way they're generated or used to be generated on Macs - they're in an *.mbox container directory, but the meat is within an mbox file contained within that directory);

          2) the headers don't fall within the 256 byte offset specified by the matcher in the mimetypes XML file.

          Show
          mcaruanagalizia Matthew Caruana Galizia added a comment - - edited I'd like to ask for this issue to be reopened. Around half the MBOX files in our corpus are being detected as text/html. My guess is that there are two reasons for this: 1) the files have no extension - the filenames are literally "mbox" rather than "*.mbox" (I think this is the way they're generated or used to be generated on Macs - they're in an *.mbox container directory, but the meat is within an mbox file contained within that directory); 2) the headers don't fall within the 256 byte offset specified by the matcher in the mimetypes XML file.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in tika-2.x-windows #28 (See https://builds.apache.org/job/tika-2.x-windows/28/)
          TIKA-2042 MBOX magic and detection unit test (nick: rev 65cc9bcecdc6b86294a88f3b2b6b26017f356ae5)

          • tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java
          • tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Show
          hudson Hudson added a comment - FAILURE: Integrated in tika-2.x-windows #28 (See https://builds.apache.org/job/tika-2.x-windows/28/ ) TIKA-2042 MBOX magic and detection unit test (nick: rev 65cc9bcecdc6b86294a88f3b2b6b26017f356ae5) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Tika-trunk #1085 (See https://builds.apache.org/job/Tika-trunk/1085/)
          TIKA-2042 MBOX magic and detection unit test (nick: rev 72d2d88b381ba75942ae791042ef54af33ee1f38)

          • tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
          • tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Tika-trunk #1085 (See https://builds.apache.org/job/Tika-trunk/1085/ ) TIKA-2042 MBOX magic and detection unit test (nick: rev 72d2d88b381ba75942ae791042ef54af33ee1f38) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in tika-2.x #124 (See https://builds.apache.org/job/tika-2.x/124/)
          TIKA-2042 MBOX magic and detection unit test (nick: rev 65cc9bcecdc6b86294a88f3b2b6b26017f356ae5)

          • tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          • tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in tika-2.x #124 (See https://builds.apache.org/job/tika-2.x/124/ ) TIKA-2042 MBOX magic and detection unit test (nick: rev 65cc9bcecdc6b86294a88f3b2b6b26017f356ae5) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java
          Hide
          gagravarr Nick Burch added a comment -

          Fixed in 72d2d88b381ba75942ae791042ef54af33ee1f38 - your test file is now detected as mbox even without the filename

          Show
          gagravarr Nick Burch added a comment - Fixed in 72d2d88b381ba75942ae791042ef54af33ee1f38 - your test file is now detected as mbox even without the filename

            People

            • Assignee:
              Unassigned
              Reporter:
              vjeran@tis.hr Vjeran Marcinko
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development