Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1127

PDF supplies glyph->unicode mapping, but PDFBox doesn't use it.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.7.0
    • 1.7.0
    • None
    • None
    • Tested trunk r1177011

    Description

      We had a user report this PDF to the lucene lists: http://www.lucidimagination.com/search/document/7a8c14a534d9a84c/tika_can_not_parse_all_of_the_persian_pdf_files

      I asked them to create a TIKA issue (TIKA-713) and attach the PDF file

      Upon inspection, the fonts used in the PDF have custom encodings (that map the characters to U+0001, U+0002, ...), however they contain a mapping for the font to unicode >>/Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences, but PDFbox doesnt use this mapping. If you use ExtractText it extracts the raw control characters instead.

      Attachments

        1. ebrat.pdf
          73 kB
          Robert Muir
        2. encoding.jpg
          50 kB
          Robert Muir
        3. PDFBOX1127-ebrat.txt
          12 kB
          Andreas Lehmkühler

        Issue Links

          Activity

            People

              lehmi Andreas Lehmkühler
              rcmuir Robert Muir
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: