Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Won't Fix
-
None
-
None
-
None
Description
I have a PDF w/ Arabic font that Tika fails to extract (gets all
gibberish).
Looks like the PDF does not include the separate Unicode text metadata
(hmm: would Tika extract that if it were present?), and copy/paste out
of the PDF also produces gibberish.
To fix this I think we'd somehow have to know the mapping for the
font (this particular font is AXTManal)?
Attachments
Attachments
Issue Links
- is related to
-
TIKA-1337 LanguageProfile for Persian/Farsi
- Resolved
I dont think there is much we can do. Some PDF files (especially those created by e.g. Latex (dvips -> pdf, pdflatex mostly works fine) use internal, dynamically compressed fonts that have their glyphs at totally different places. This is often done when the pdf creator use antique software/fonts, that only know 256 code points (pre-unicode time). In that case, the font file only contains the glyphs actually present in the text, compressed to codepoints available.
Those PDFs are unparseable and full text extraction not even works with Acrobat Reader. But those are still valid PDF files, as they are intended to be printed out. This is like a PDF file only containing a bg TIFF image instead of text - text cannot be extracted.