[PDFBOX-192] Find encodings in FontFile3 - CompactFont Format - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.3.1
Component/s: Text extraction
Labels:
None

Description

[imported from SourceForge]
http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1545266
Originally submitted by zlaya_buka on 2006-08-23 06:04.

Finding encoding problem

Debugging of a page from the set (uploaded: 01_.pdf)
showed:

all the fonts are of the same subtype - Type 1
there are no cmaps for any fonts
encoding dictionaries for all fonts are practically
useless - each of the font encoding entries contains
only differences array (with just one mapping for a
code that seems not to be used on the page)

I discovered from the source that in such a case
PDFBox tries to read encoding info from font directly:

COSStream fontFile = (COSStream)
fontDescriptor.getDictionaryObject(COSName.FONT_FILE);
if( fontFile != null )
{
BufferedReader in = new BufferedReader(new
InputStreamReader(fontFile.getUnfilteredStream()));
/**

this section parse the FileProgram stream searching
for a /Encoding entry
the research stop if the entry "currentdict end" is
reach or after 100 lignes
*/
...
}

The problem is that all the fonts on my page are
marked in their fontdescriptors as FontFile3 - ie are
in CompactFont Format. It seems from the above code
that PDFBox parses only COSName.FONT_FILE and ignores
COSName.FONT_FILE3. So finally I get StandardEncoding
for all the characters - that's not the case since all
the pages are in russian.

Is there any chance to find out the solution of
extracting encoding from compact font - it seems that
in my case it's the only place where this info can be
found since Acrobat displays all the files correct
(TextStripper returns mostly spaces and trash)

[attachment on SourceForge]
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552835&aid=1545266&file_id=190290
01.zip (application/x-zip-compressed), 196513 bytes
Sample file (first page of newspaper, rus) + font report (PDFLib font reporter)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PDFBOX191-01_1.png
02/Oct/10 15:44
1.29 MB
Andreas Lehmkühler
PDFBOX191-01_.txt
02/Oct/10 15:44
11 kB
Andreas Lehmkühler
PDFBOX191-01_.pdf
02/Oct/10 15:44
134 kB
Andreas Lehmkühler

Issue Links

is related to

PDFBOX-2220 [PATCH] Differences array without BaseEncoding (Type1C)

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Anonymous

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 23/Aug/06 13:04

Updated:: 19/Jul/14 14:05

Resolved:: 02/Oct/10 15:46