Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1602

Detecting standards-non-compliant emails as message/rfc822

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.10
    • Component/s: mime
    • Labels:
      None

      Description

      Tika does not properly detect certain emails as `message/rfc822` if they're slightly standards-non-compliant and begin with `Status: ` as the first header. I've added `Status: ` as a magic detection line in tika-mimetypes.xml.

      This solves my problem and does not appear to cause unit test failures. I have not yet run the tika-batch tests.

      As further information, the emails that are processed incorrectly come from dumps directly from various US public officials' mailservers. The dumps, I believe since they're not intended to be transmitted over the wire, sometimes are slightly non-compliant.

      It's important to note that Tika (and the underlying library, James Mime4J) do properly parse these emails, despite the non-compliant header. The problem is getting Tika to detect the file as an email so that Mime4J gets chosen to parse it.

      Pull request on Github at https://github.com/apache/tika/pull/40

      1. 036491.txt.zip
        25 kB
        Tim Allison

        Issue Links

          Activity

          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          I think this is a duplicate of TIKA-879, where a more generic solution is discussed.

          Show
          lfcnassif Luis Filipe Nassif added a comment - I think this is a duplicate of TIKA-879 , where a more generic solution is discussed.
          Hide
          jeremybmerrill Jeremy B. Merrill added a comment -

          Sounds about right, thanks for finding that for me. I'll go ahead and mark the issue a dupe or close it.

          Any idea when that patch'll get merged into trunk? (Or – since I'm an svn n00b – if there's a way for me to download that patched version.)

          Show
          jeremybmerrill Jeremy B. Merrill added a comment - Sounds about right, thanks for finding that for me. I'll go ahead and mark the issue a dupe or close it. Any idea when that patch'll get merged into trunk? (Or – since I'm an svn n00b – if there's a way for me to download that patched version.)
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Hi Jeremy B. Merrill thanks. Luis Filipe Nassif thanks for the pointer. So Jeremy proposed a patch on this one and there is an open pull request here. I am going to propose that we look at that even though a generic solution is being discussed there is nothing against getting existing solutions in sooner rather than later - and then when there are generic code available to go with those generic solutions we can push those too and constantly improve.

          Show
          chrismattmann Chris A. Mattmann added a comment - Hi Jeremy B. Merrill thanks. Luis Filipe Nassif thanks for the pointer. So Jeremy proposed a patch on this one and there is an open pull request here. I am going to propose that we look at that even though a generic solution is being discussed there is nothing against getting existing solutions in sooner rather than later - and then when there are generic code available to go with those generic solutions we can push those too and constantly improve.
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Tim Allison can you test out: https://github.com/apache/tika/pull/40 and see if there are any regressions? If there aren't I'd like to commit this in the next 24-48 hours.

          Show
          chrismattmann Chris A. Mattmann added a comment - Tim Allison can you test out: https://github.com/apache/tika/pull/40 and see if there are any regressions? If there aren't I'd like to commit this in the next 24-48 hours.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Kicked off process now. Will run comparison in the morning.

          Show
          tallison@mitre.org Tim Allison added a comment - Kicked off process now. Will run comparison in the morning.
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          One file out of 116,960 text/plain files was misidenfied as rfc822 in govdocs1. No other diffs found.

          Ymmv.

          What's odd (to me) is that the rfc parser parsed lots and lots of empty embedded documents, and none of them had any text:

            {
              "Content-Type": "application/zip",
              "X-Parsed-By": [
                "org.apache.tika.parser.DefaultParser",
                "org.apache.tika.parser.pkg.PackageParser"
              ],
              "X-TIKA:content": "\n036491.txt\n\n",
              "X-TIKA:digest:MD5": "e7cf541cbd061b63c03035ec692b86c9",
              "X-TIKA:digest:SHA256": "96b29ca0c2206feafd6115d993c1fb20ead631381f048442c87870934fb2cd8e",
              "X-TIKA:parse_time_millis": "140"
            },
            {
              "Content-Encoding": "US-ASCII",
              "Content-Type": "text/plain; charset\u003dUS-ASCII",
              "X-Parsed-By": [
                "org.apache.tika.parser.DefaultParser",
                "org.apache.tika.parser.txt.TXTParser"
              ],
              "X-TIKA:digest:MD5": "4ef8164712f6491c2848e861336987b5",
              "X-TIKA:digest:SHA256": "9c8d0d8dc8633ab1a8324bcd19679616729360171fde33812b12c335938f45dc",
              "X-TIKA:embedded_resource_path": "embedded-1/036491.txt/embedded-2/embedded-3/embedded-4/embedded-5/embedded-6/embedded-7/embedded-8/embedded-9/embedded-10/embedded-11/embedded-12/embedded-13/embedded-14/embedded-15/embedded-16/embedded-17/embedded-18/embedded-19/embedded-20/embedded-21/embedded-22/embedded-23/embedded-24/embedded-25/embedded-26/embedded-27/embedded-28/embedded-29/embedded-30/embedded-31/embedded-32/embedded-33/embedded-34/embedded-35/embedded-36/embedded-37/embedded-38/embedded-39/embedded-40/embedded-41/embedded-42/embedded-43/embedded-44/embedded-45/embedded-46/embedded-47/embedded-48/embedded-49/embedded-50/embedded-51/embedded-52/embedded-53/embedded-54"
            },
          
          Show
          tallison@mitre.org Tim Allison added a comment - - edited One file out of 116,960 text/plain files was misidenfied as rfc822 in govdocs1. No other diffs found. Ymmv. What's odd (to me) is that the rfc parser parsed lots and lots of empty embedded documents, and none of them had any text: { "Content-Type": "application/zip", "X-Parsed-By": [ "org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.pkg.PackageParser" ], "X-TIKA:content": "\n036491.txt\n\n", "X-TIKA:digest:MD5": "e7cf541cbd061b63c03035ec692b86c9", "X-TIKA:digest:SHA256": "96b29ca0c2206feafd6115d993c1fb20ead631381f048442c87870934fb2cd8e", "X-TIKA:parse_time_millis": "140" }, { "Content-Encoding": "US-ASCII", "Content-Type": "text/plain; charset\u003dUS-ASCII", "X-Parsed-By": [ "org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.txt.TXTParser" ], "X-TIKA:digest:MD5": "4ef8164712f6491c2848e861336987b5", "X-TIKA:digest:SHA256": "9c8d0d8dc8633ab1a8324bcd19679616729360171fde33812b12c335938f45dc", "X-TIKA:embedded_resource_path": "embedded-1/036491.txt/embedded-2/embedded-3/embedded-4/embedded-5/embedded-6/embedded-7/embedded-8/embedded-9/embedded-10/embedded-11/embedded-12/embedded-13/embedded-14/embedded-15/embedded-16/embedded-17/embedded-18/embedded-19/embedded-20/embedded-21/embedded-22/embedded-23/embedded-24/embedded-25/embedded-26/embedded-27/embedded-28/embedded-29/embedded-30/embedded-31/embedded-32/embedded-33/embedded-34/embedded-35/embedded-36/embedded-37/embedded-38/embedded-39/embedded-40/embedded-41/embedded-42/embedded-43/embedded-44/embedded-45/embedded-46/embedded-47/embedded-48/embedded-49/embedded-50/embedded-51/embedded-52/embedded-53/embedded-54" },
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Hey Tim Allison this doesn't sound like a significant regression. Are you +1 for me to commit this?

          (thanks BTW!)

          Show
          chrismattmann Chris A. Mattmann added a comment - Hey Tim Allison this doesn't sound like a significant regression. Are you +1 for me to commit this? (thanks BTW!)
          Hide
          tallison@mitre.org Tim Allison added a comment -

          +1.

          This feels hacky, but we can undo it. Govdocs1 is limited, and our mileage will vary. Hopefully, someone will have the time to work on TIKA-879 soon.

          Jeremy B. Merrill, I'm sorry for taking so long to getting around to running this simple test. Out of curiosity, what other headers were you getting in that batch of emails? I'm wondering if there are more specific rfc822'ish headers that we could rely on, or were you only getting "Status:"?

          Show
          tallison@mitre.org Tim Allison added a comment - +1. This feels hacky, but we can undo it. Govdocs1 is limited, and our mileage will vary. Hopefully, someone will have the time to work on TIKA-879 soon. Jeremy B. Merrill , I'm sorry for taking so long to getting around to running this simple test. Out of curiosity, what other headers were you getting in that batch of emails? I'm wondering if there are more specific rfc822'ish headers that we could rely on, or were you only getting "Status:"?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tika/pull/40

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/40
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Patch applied! Thanks Jeremy B. Merrill!

          [chipotle:~/tmp/tika1.10] mattmann% svn commit -m "Fix for TIKA-1602: Detecting standards-non-compliant emails as message/rfc822 contributed by Jeremy B. Merrill <jeremy.merrill@nytimes.com> this closes #40."
          Sending        CHANGES.txt
          Sending        tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Transmitting file data ..
          Committed revision 1688647.
          [chipotle:~/tmp/tika1.10] mattmann% 
          
          Show
          chrismattmann Chris A. Mattmann added a comment - Patch applied! Thanks Jeremy B. Merrill ! [chipotle:~/tmp/tika1.10] mattmann% svn commit -m "Fix for TIKA-1602: Detecting standards-non-compliant emails as message/rfc822 contributed by Jeremy B. Merrill <jeremy.merrill@nytimes.com> this closes #40." Sending CHANGES.txt Sending tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml Transmitting file data .. Committed revision 1688647. [chipotle:~/tmp/tika1.10] mattmann%
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #777 (See https://builds.apache.org/job/tika-trunk-jdk1.7/777/)
          Fix for TIKA-1602: Detecting standards-non-compliant emails as message/rfc822 contributed by Jeremy B. Merrill <jeremy.merrill@nytimes.com> this closes #40. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1688647)

          • /tika/trunk/CHANGES.txt
          • /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #777 (See https://builds.apache.org/job/tika-trunk-jdk1.7/777/ ) Fix for TIKA-1602 : Detecting standards-non-compliant emails as message/rfc822 contributed by Jeremy B. Merrill <jeremy.merrill@nytimes.com> this closes #40. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1688647 ) /tika/trunk/CHANGES.txt /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Hide
          jeremybmerrill Jeremy B. Merrill added a comment -

          Thank you, Chris A. Mattmann, Tim Allison et al.!

          Tim Allison – got a bunch of normal headers, but also this `Status:` one. The only possible value in my dataset (a bunch of publicly-released emails from Jeb Bush's tenure as FL Gov) is `RO`, so the first lines of the emails who were treated improperly by Tika before this patch was uniformly `Status: RO`.

          I'm going to check the whole dataset once I manage to download it all back down again from storage to make sure there are no other values than `RO`.

          My understanding is that some mail servers use this header internally to keep track of read status. When emails are exported, they retain the header, and it sometimes appears first – even though the server would never send this header over the wire.

          Show
          jeremybmerrill Jeremy B. Merrill added a comment - Thank you, Chris A. Mattmann , Tim Allison et al.! Tim Allison – got a bunch of normal headers, but also this `Status:` one. The only possible value in my dataset (a bunch of publicly-released emails from Jeb Bush's tenure as FL Gov) is `RO`, so the first lines of the emails who were treated improperly by Tika before this patch was uniformly `Status: RO`. I'm going to check the whole dataset once I manage to download it all back down again from storage to make sure there are no other values than `RO`. My understanding is that some mail servers use this header internally to keep track of read status. When emails are exported, they retain the header, and it sometimes appears first – even though the server would never send this header over the wire.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #779 (See https://builds.apache.org/job/tika-trunk-jdk1.7/779/)
          Remove change comment, TIKA-1602 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1688805)

          • /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #779 (See https://builds.apache.org/job/tika-trunk-jdk1.7/779/ ) Remove change comment, TIKA-1602 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1688805 ) /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Hide
          jeremybmerrill Jeremy B. Merrill added a comment -

          Looks like the possible values are:
          ```
          Status: O
          Status:
          Status: U
          Status: O
          Status: R
          Status: RO
          Status: U
          Status: U
          ```

          Show
          jeremybmerrill Jeremy B. Merrill added a comment - Looks like the possible values are: ``` Status: O Status: Status: U Status: O Status: R Status: RO Status: U Status: U ```
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Got it, Jeremy B. Merrill - can you open a new Pull request and JIRA issue and send em' along?

          Show
          chrismattmann Chris A. Mattmann added a comment - Got it, Jeremy B. Merrill - can you open a new Pull request and JIRA issue and send em' along?

            People

            • Assignee:
              chrismattmann Chris A. Mattmann
              Reporter:
              jeremybmerrill Jeremy B. Merrill
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development