Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2478

RFC822 includes redundant copies of the text

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.16
    • Fix Version/s: 1.17
    • Component/s: None
    • Labels:
      None

      Description

      MBOX messages often get parsed into four documents:
      a. The mbox file - outer container "/"
      b. The actual email-- "/embedded-1"
      c. The utf-8 text content of the email "/embedded-1/embedded-2"
      d. The utf-8 html content of the email "/embedded-1/embedded-3"

      entries C and D are redundant and distracting. The MSG parser parses the first non-null: email body and then it skips the rest. Please modify MBOX to not have separate "attached" documents for the html body and the text body.

      The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an example of input sufficient to generate this behavior.

      Thanks!

        Attachments

        1. TIKA-2478.patch
          80 kB
          Tim Allison
        2. mixed-with-pdf-inline
          40 kB
          Ken Krugler
        3. mixed-simple
          2 kB
          Ken Krugler
        4. UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml
          8 kB
          Tim Allison

          Issue Links

            Activity

              People

              • Assignee:
                tallison@mitre.org Tim Allison
                Reporter:
                letzlerr Robert Letzler
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: