Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.7.0
-
None
-
None
-
Tested trunk r1177011
Description
We had a user report this PDF to the lucene lists: http://www.lucidimagination.com/search/document/7a8c14a534d9a84c/tika_can_not_parse_all_of_the_persian_pdf_files
I asked them to create a TIKA issue (TIKA-713) and attach the PDF file
Upon inspection, the fonts used in the PDF have custom encodings (that map the characters to U+0001, U+0002, ...), however they contain a mapping for the font to unicode >>/Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences, but PDFbox doesnt use this mapping. If you use ExtractText it extracts the raw control characters instead.
Attachments
Attachments
Issue Links
- is depended upon by
-
TIKA-713 Tika can not parse all of the persian pdf files
- Resolved