Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-722

Arabic PDF doesn't extract correctly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • None
    • None
    • parser
    • None

    Description

      I have a PDF w/ Arabic font that Tika fails to extract (gets all
      gibberish).

      Looks like the PDF does not include the separate Unicode text metadata
      (hmm: would Tika extract that if it were present?), and copy/paste out
      of the PDF also produces gibberish.

      To fix this I think we'd somehow have to know the mapping for the
      font (this particular font is AXTManal)?

      Attachments

        1. 000279.pdf
          89 kB
          Michael McCandless
        2. JUFO96.PDF
          190 kB
          Uwe Schindler
        3. metadata.png
          49 kB
          Uwe Schindler

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mikemccand Michael McCandless
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: