[PDFBOX-1127] PDF supplies glyph->unicode mapping, but PDFBox doesn't use it. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.7.0
Fix Version/s: 1.7.0
Component/s: None
Labels:
None
Environment:
Tested trunk r1177011

Description

We had a user report this PDF to the lucene lists: http://www.lucidimagination.com/search/document/7a8c14a534d9a84c/tika_can_not_parse_all_of_the_persian_pdf_files

I asked them to create a TIKA issue (~~TIKA-713~~) and attach the PDF file

Upon inspection, the fonts used in the PDF have custom encodings (that map the characters to U+0001, U+0002, ...), however they contain a mapping for the font to unicode >>/Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences, but PDFbox doesnt use this mapping. If you use ExtractText it extracts the raw control characters instead.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ebrat.pdf
02/Oct/11 20:09
73 kB
Robert Muir
encoding.jpg
02/Oct/11 20:08
50 kB
Robert Muir
PDFBOX1127-ebrat.txt
03/Oct/11 15:23
12 kB
Andreas Lehmkühler

Issue Links

is depended upon by

TIKA-713 Tika can not parse all of the persian pdf files

Resolved

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Robert Muir

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 02/Oct/11 20:08

Updated:: 02/May/13 02:29

Resolved:: 03/Oct/11 15:23