Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-722

Arabic PDF doesn't extract correctly

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • None
    • None
    • parser
    • None

    Description

      I have a PDF w/ Arabic font that Tika fails to extract (gets all
      gibberish).

      Looks like the PDF does not include the separate Unicode text metadata
      (hmm: would Tika extract that if it were present?), and copy/paste out
      of the PDF also produces gibberish.

      To fix this I think we'd somehow have to know the mapping for the
      font (this particular font is AXTManal)?

      Attachments

        1. JUFO96.PDF
          190 kB
          Uwe Schindler
        2. metadata.png
          49 kB
          Uwe Schindler
        3. 000279.pdf
          89 kB
          Michael McCandless

        Issue Links

          Activity

            uschindler Uwe Schindler added a comment -

            I dont think there is much we can do. Some PDF files (especially those created by e.g. Latex (dvips -> pdf, pdflatex mostly works fine) use internal, dynamically compressed fonts that have their glyphs at totally different places. This is often done when the pdf creator use antique software/fonts, that only know 256 code points (pre-unicode time). In that case, the font file only contains the glyphs actually present in the text, compressed to codepoints available.

            Those PDFs are unparseable and full text extraction not even works with Acrobat Reader. But those are still valid PDF files, as they are intended to be printed out. This is like a PDF file only containing a bg TIFF image instead of text - text cannot be extracted.

            uschindler Uwe Schindler added a comment - I dont think there is much we can do. Some PDF files (especially those created by e.g. Latex (dvips -> pdf, pdflatex mostly works fine) use internal, dynamically compressed fonts that have their glyphs at totally different places. This is often done when the pdf creator use antique software/fonts, that only know 256 code points (pre-unicode time). In that case, the font file only contains the glyphs actually present in the text, compressed to codepoints available. Those PDFs are unparseable and full text extraction not even works with Acrobat Reader. But those are still valid PDF files, as they are intended to be printed out. This is like a PDF file only containing a bg TIFF image instead of text - text cannot be extracted.
            uschindler Uwe Schindler added a comment -

            I checked this file: Thats exactly this type of file I am talking about, here the Metadata, attached as screen shot: Power Macintosh in 1999 with Acrobat Distiller 3.0, embedded only subsets of the fonts. At this time, Macintosh did not even know unicode...

            uschindler Uwe Schindler added a comment - I checked this file: Thats exactly this type of file I am talking about, here the Metadata, attached as screen shot: Power Macintosh in 1999 with Acrobat Distiller 3.0, embedded only subsets of the fonts. At this time, Macintosh did not even know unicode...
            uschindler Uwe Schindler added a comment -

            Here is a non-persian example (which is actually a very-very early writeup from myself, back in 1996, from my personal archive - don't read it). If you try to copypaste text out of it you will see the same problem. It's also Acrobat Distiller 3.0 with font subsets.

            uschindler Uwe Schindler added a comment - Here is a non-persian example (which is actually a very-very early writeup from myself, back in 1996, from my personal archive - don't read it). If you try to copypaste text out of it you will see the same problem. It's also Acrobat Distiller 3.0 with font subsets.

            Thanks Uwe; it sounds like there's not much we can do for such old PDFs.

            mikemccand Michael McCandless added a comment - Thanks Uwe; it sounds like there's not much we can do for such old PDFs.
            rcmuir Robert Muir added a comment -

            Actually in this case the original TTF font (AxtManal) is buggy.
            The font actually uses glyph codes with a unicode mapping (1-1 to their unicode chars) but the names are WRONG.

            So arabic glyphs in this font have misleading names like 'circumflex' and stuff like that in the font, causing
            whatever produced this PDF to be really confused when it embedded it... you can see this if you open the original TTF
            in fontforge, it will give tons of warnings:

            'The glyph named circumflex is mapped to U+F0F6 But its name indicates it should be mapped to U+02C6'

            Its not possible to open the embedded font in the PDF, it claims its corrumpted

            rcmuir Robert Muir added a comment - Actually in this case the original TTF font (AxtManal) is buggy. The font actually uses glyph codes with a unicode mapping (1-1 to their unicode chars) but the names are WRONG. So arabic glyphs in this font have misleading names like 'circumflex' and stuff like that in the font, causing whatever produced this PDF to be really confused when it embedded it... you can see this if you open the original TTF in fontforge, it will give tons of warnings: 'The glyph named circumflex is mapped to U+F0F6 But its name indicates it should be mapped to U+02C6' Its not possible to open the embedded font in the PDF, it claims its corrumpted

            OK resolving as Won't Fix.

            I don't see how Tika can recover when the font itself is buggy... though it is tantalizing that the glyph IDs for this font are in fact Unicode code points.

            I just hope there are not too many buggy fonts out there!

            mikemccand Michael McCandless added a comment - OK resolving as Won't Fix. I don't see how Tika can recover when the font itself is buggy... though it is tantalizing that the glyph IDs for this font are in fact Unicode code points. I just hope there are not too many buggy fonts out there!

            People

              Unassigned Unassigned
              mikemccand Michael McCandless
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: