-
Type:
Bug
-
Status: Closed
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 1.7.0
-
Fix Version/s: 1.7.0
-
Component/s: None
-
Labels:None
-
Environment:Tested trunk r1177011
We had a user report this PDF to the lucene lists: http://www.lucidimagination.com/search/document/7a8c14a534d9a84c/tika_can_not_parse_all_of_the_persian_pdf_files
I asked them to create a TIKA issue (TIKA-713) and attach the PDF file
Upon inspection, the fonts used in the PDF have custom encodings (that map the characters to U+0001, U+0002, ...), however they contain a mapping for the font to unicode >>/Type/Encoding/BaseEncoding/WinAnsiEncoding/Differences, but PDFbox doesnt use this mapping. If you use ExtractText it extracts the raw control characters instead.
- is depended upon by
-
TIKA-713 Tika can not parse all of the persian pdf files
-
- Resolved
-