Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1127

PDF supplies glyph->unicode mapping, but PDFBox doesn't use it.

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.7.0
    • Fix Version/s: 1.7.0
    • Component/s: None
    • Labels:
      None
    • Environment:
      Tested trunk r1177011

      Description

      We had a user report this PDF to the lucene lists: http://www.lucidimagination.com/search/document/7a8c14a534d9a84c/tika_can_not_parse_all_of_the_persian_pdf_files

      I asked them to create a TIKA issue (TIKA-713) and attach the PDF file

      Upon inspection, the fonts used in the PDF have custom encodings (that map the characters to U+0001, U+0002, ...), however they contain a mapping for the font to unicode >>/Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences, but PDFbox doesnt use this mapping. If you use ExtractText it extracts the raw control characters instead.

        Attachments

        1. ebrat.pdf
          73 kB
          Robert Muir
        2. encoding.jpg
          50 kB
          Robert Muir
        3. PDFBOX1127-ebrat.txt
          12 kB
          Andreas Lehmkühler

        Issue Links

          Activity

            People

            • Assignee:
              lehmi Andreas Lehmkühler
              Reporter:
              rcmuir Robert Muir

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment