Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2478

RFC822 includes redundant copies of the text

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.16
    • 1.17
    • None
    • None

    Description

      MBOX messages often get parsed into four documents:
      a. The mbox file - outer container "/"
      b. The actual email-- "/embedded-1"
      c. The utf-8 text content of the email "/embedded-1/embedded-2"
      d. The utf-8 html content of the email "/embedded-1/embedded-3"

      entries C and D are redundant and distracting. The MSG parser parses the first non-null: email body and then it skips the rest. Please modify MBOX to not have separate "attached" documents for the html body and the text body.

      The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an example of input sufficient to generate this behavior.

      Thanks!

      Attachments

        1. UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml
          8 kB
          Tim Allison
        2. mixed-simple
          2 kB
          Kenneth William Krugler
        3. mixed-with-pdf-inline
          40 kB
          Kenneth William Krugler
        4. TIKA-2478.patch
          80 kB
          Tim Allison

        Issue Links

          Activity

            People

              tallison Tim Allison
              letzlerr Robert Letzler
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: