Details
-
New Feature
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
Description
[imported from SourceForge]
http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1545266
Originally submitted by zlaya_buka on 2006-08-23 06:04.
Finding encoding problem
Debugging of a page from the set (uploaded: 01_.pdf)
showed:
- all the fonts are of the same subtype - Type 1
- there are no cmaps for any fonts
- encoding dictionaries for all fonts are practically
useless - each of the font encoding entries contains
only differences array (with just one mapping for a
code that seems not to be used on the page)
I discovered from the source that in such a case
PDFBox tries to read encoding info from font directly:
COSStream fontFile = (COSStream)
fontDescriptor.getDictionaryObject(COSName.FONT_FILE);
if( fontFile != null )
{
BufferedReader in = new BufferedReader(new
InputStreamReader(fontFile.getUnfilteredStream()));
/**
- this section parse the FileProgram stream searching
for a /Encoding entry - the research stop if the entry "currentdict end" is
reach or after 100 lignes
*/
...
}
The problem is that all the fonts on my page are
marked in their fontdescriptors as FontFile3 - ie are
in CompactFont Format. It seems from the above code
that PDFBox parses only COSName.FONT_FILE and ignores
COSName.FONT_FILE3. So finally I get StandardEncoding
for all the characters - that's not the case since all
the pages are in russian.
Is there any chance to find out the solution of
extracting encoding from compact font - it seems that
in my case it's the only place where this info can be
found since Acrobat displays all the files correct
(TextStripper returns mostly spaces and trash)
[attachment on SourceForge]
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552835&aid=1545266&file_id=190290
01.zip (application/x-zip-compressed), 196513 bytes
Sample file (first page of newspaper, rus) + font report (PDFLib font reporter)
Attachments
Issue Links
- is related to
-
PDFBOX-2220 [PATCH] Differences array without BaseEncoding (Type1C)
- Closed