Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3545

TIKA PDF parsing issues

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Bug
    • 1.21
    • None
    • parser, tika-server
    • None
    • Tested on DEV env

    Description

      I am using tika-core 1.21 and tika-parsers 1.21 jar files as tika dependencies in Manifoldcf 2.14 version to crawl some files, Out of which some of the PDF's files are not getting parsed correctly.
      Getting some issues while parsing PDF files. Some strange characters appeared, tried changing Tika jar files version also 1.24 and 1.27 (for 1.27-it didn't even extract files correctly).
       
      Also checked with the document content, it seems to be fine.
      Can anybody help me on this.

      Image attached for reference of strange characters.

      Tried version changing , but didn't help

      Attachments

        1. 365.jpg
          81 kB
          Priya
        2. Handleiding Digitale koffiecorner.pdf
          295 kB
          Priya
        3. KSF1.pdf
          64 kB
          Priya
        4. KSF1.txt
          1 kB
          Tim Allison

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            PriSmart Priya
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment