Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-192

Find encodings in FontFile3 - CompactFont Format

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.1
    • Component/s: Text extraction
    • Labels:
      None

      Description

      [imported from SourceForge]
      http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1545266
      Originally submitted by zlaya_buka on 2006-08-23 06:04.

      Finding encoding problem

      Debugging of a page from the set (uploaded: 01_.pdf)
      showed:

      • all the fonts are of the same subtype - Type 1
      • there are no cmaps for any fonts
      • encoding dictionaries for all fonts are practically
        useless - each of the font encoding entries contains
        only differences array (with just one mapping for a
        code that seems not to be used on the page)

      I discovered from the source that in such a case
      PDFBox tries to read encoding info from font directly:

      COSStream fontFile = (COSStream)
      fontDescriptor.getDictionaryObject(COSName.FONT_FILE);
      if( fontFile != null )
      {
      BufferedReader in = new BufferedReader(new
      InputStreamReader(fontFile.getUnfilteredStream()));
      /**

      • this section parse the FileProgram stream searching
        for a /Encoding entry
      • the research stop if the entry "currentdict end" is
        reach or after 100 lignes
        */
        ...
        }

      The problem is that all the fonts on my page are
      marked in their fontdescriptors as FontFile3 - ie are
      in CompactFont Format. It seems from the above code
      that PDFBox parses only COSName.FONT_FILE and ignores
      COSName.FONT_FILE3. So finally I get StandardEncoding
      for all the characters - that's not the case since all
      the pages are in russian.

      Is there any chance to find out the solution of
      extracting encoding from compact font - it seems that
      in my case it's the only place where this info can be
      found since Acrobat displays all the files correct
      (TextStripper returns mostly spaces and trash)

      [attachment on SourceForge]
      http://sourceforge.net/tracker/download.php?group_id=78314&atid=552835&aid=1545266&file_id=190290
      01.zip (application/x-zip-compressed), 196513 bytes
      Sample file (first page of newspaper, rus) + font report (PDFLib font reporter)

        Attachments

        1. PDFBOX191-01_1.png
          1.29 MB
          Andreas Lehmkühler
        2. PDFBOX191-01_.txt
          11 kB
          Andreas Lehmkühler
        3. PDFBOX191-01_.pdf
          134 kB
          Andreas Lehmkühler

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                Anonymous
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: