[TIKA-722] Arabic PDF doesn't extract correctly - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: parser
Labels:
None

Description

I have a PDF w/ Arabic font that Tika fails to extract (gets all
gibberish).

Looks like the PDF does not include the separate Unicode text metadata
(hmm: would Tika extract that if it were present?), and copy/paste out
of the PDF also produces gibberish.

To fix this I think we'd somehow have to know the mapping for the
font (this particular font is AXTManal)?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

JUFO96.PDF
19/Sep/11 17:24
190 kB
Uwe Schindler
metadata.png
19/Sep/11 17:15
49 kB
Uwe Schindler
000279.pdf
19/Sep/11 17:02
89 kB
Michael McCandless

Issue Links

is related to

TIKA-1337 LanguageProfile for Persian/Farsi

Resolved

Activity

Ascending order - Click to sort in descending order

Uwe Schindler added a comment - 19/Sep/11 17:10

I dont think there is much we can do. Some PDF files (especially those created by e.g. Latex (dvips -> pdf, pdflatex mostly works fine) use internal, dynamically compressed fonts that have their glyphs at totally different places. This is often done when the pdf creator use antique software/fonts, that only know 256 code points (pre-unicode time). In that case, the font file only contains the glyphs actually present in the text, compressed to codepoints available.

Those PDFs are unparseable and full text extraction not even works with Acrobat Reader. But those are still valid PDF files, as they are intended to be printed out. This is like a PDF file only containing a bg TIFF image instead of text - text cannot be extracted.

Uwe Schindler added a comment - 19/Sep/11 17:10 I dont think there is much we can do. Some PDF files (especially those created by e.g. Latex (dvips -> pdf, pdflatex mostly works fine) use internal, dynamically compressed fonts that have their glyphs at totally different places. This is often done when the pdf creator use antique software/fonts, that only know 256 code points (pre-unicode time). In that case, the font file only contains the glyphs actually present in the text, compressed to codepoints available. Those PDFs are unparseable and full text extraction not even works with Acrobat Reader. But those are still valid PDF files, as they are intended to be printed out. This is like a PDF file only containing a bg TIFF image instead of text - text cannot be extracted.

Uwe Schindler added a comment - 19/Sep/11 17:15

I checked this file: Thats exactly this type of file I am talking about, here the Metadata, attached as screen shot: Power Macintosh in 1999 with Acrobat Distiller 3.0, embedded only subsets of the fonts. At this time, Macintosh did not even know unicode...

Uwe Schindler added a comment - 19/Sep/11 17:15 I checked this file: Thats exactly this type of file I am talking about, here the Metadata, attached as screen shot: Power Macintosh in 1999 with Acrobat Distiller 3.0, embedded only subsets of the fonts. At this time, Macintosh did not even know unicode...

Uwe Schindler added a comment - 19/Sep/11 17:24

Here is a non-persian example (which is actually a very-very early writeup from myself, back in 1996, from my personal archive - don't read it). If you try to copypaste text out of it you will see the same problem. It's also Acrobat Distiller 3.0 with font subsets.

Uwe Schindler added a comment - 19/Sep/11 17:24 Here is a non-persian example (which is actually a very-very early writeup from myself, back in 1996, from my personal archive - don't read it). If you try to copypaste text out of it you will see the same problem. It's also Acrobat Distiller 3.0 with font subsets.

Michael McCandless added a comment - 19/Sep/11 18:03

Thanks Uwe; it sounds like there's not much we can do for such old PDFs.

Michael McCandless added a comment - 19/Sep/11 18:03 Thanks Uwe; it sounds like there's not much we can do for such old PDFs.

Robert Muir added a comment - 03/Oct/11 17:07

Actually in this case the original TTF font (AxtManal) is buggy.
The font actually uses glyph codes with a unicode mapping (1-1 to their unicode chars) but the names are WRONG.

So arabic glyphs in this font have misleading names like 'circumflex' and stuff like that in the font, causing
whatever produced this PDF to be really confused when it embedded it... you can see this if you open the original TTF
in fontforge, it will give tons of warnings:

'The glyph named circumflex is mapped to U+F0F6 But its name indicates it should be mapped to U+02C6'

Its not possible to open the embedded font in the PDF, it claims its corrumpted

Robert Muir added a comment - 03/Oct/11 17:07 Actually in this case the original TTF font (AxtManal) is buggy. The font actually uses glyph codes with a unicode mapping (1-1 to their unicode chars) but the names are WRONG. So arabic glyphs in this font have misleading names like 'circumflex' and stuff like that in the font, causing whatever produced this PDF to be really confused when it embedded it... you can see this if you open the original TTF in fontforge, it will give tons of warnings: 'The glyph named circumflex is mapped to U+F0F6 But its name indicates it should be mapped to U+02C6' Its not possible to open the embedded font in the PDF, it claims its corrumpted

Michael McCandless added a comment - 03/Oct/11 17:15

OK resolving as Won't Fix.

I don't see how Tika can recover when the font itself is buggy... though it is tantalizing that the glyph IDs for this font are in fact Unicode code points.

I just hope there are not too many buggy fonts out there!

Michael McCandless added a comment - 03/Oct/11 17:15 OK resolving as Won't Fix. I don't see how Tika can recover when the font itself is buggy... though it is tantalizing that the glyph IDs for this font are in fact Unicode code points. I just hope there are not too many buggy fonts out there!

Tika

Arabic PDF doesn't extract correctly

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates