Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3599

Command line tika extracts encoding of file in eml

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.1.0
    • None
    • app
    • None
    • Windows 10 pro version 10.0.19043 Build 19043

      Java:

      openjdk version "1.8.0-262"
      OpenJDK Runtime Environment (build 1.8.0-262-b10)
      OpenJDK 64-Bit Server VM (build 25.71-b10, mixed mode)

      OCR:

      Tesseract 5

    Description

      Tika cannot extract the text in the attached .eml file. Instead, it returns what I think is the encoding of the attachments. 

      This does not happen in all .eml files but we have not been able to identify the cause of this behavior. The same file saved in .msg format is extracted correctly.

      The extracted .txt file has the same size as the original .eml file.

      I will attach the .eml file and the output provided by tika.

      The command used is

      java -jar tika-app-2.1.0.jar path\to\eml_test.eml > output.txt 

      Attachments

        1. output.txt
          12.94 MB
          GIOELE PERIN
        2. fix_1.png
          23 kB
          Tim Allison
        3. eml_test.eml
          12.16 MB
          GIOELE PERIN
        4. as_is.png
          22 kB
          Tim Allison

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            GPerin GIOELE PERIN

            Dates

              Created:
              Updated:

              Slack

                Issue deployment