[PDFBOX-833] Wrong encoding with Type1C font when specific encoding is defined - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.3.1
Fix Version/s: 2.0.0
Component/s: Parsing
Labels:
None

Description

The Type1C font implementation overwrites the encoding() method of PDFont base class. This results in a lookup of codes to characters as defined in the font.
However if an encoding is explicitly given (like WinAnsiEncoding) this leads to wrong results if encoding codes do not match glyph codes.
In a test document (which unfortunately I cannot make public - an article from Elsevier) a Type1C font is embedded which defines a copyright sign at glyph position 259. The encoding is defines as WinAnsiEncoding. Text characters are defined corresponding to the WinAnsiEncoding. In case of the copyright sign it is 0xa9 (169) where the font has glyph 'quotesingle' defined.
Since currently I have no other test cases I implemented following workaround for WinAnsiEncoding (which might be relaxed to other PDF encodings as well:
in PDType1CFont.encode() I start with:

if ( getEncoding() instanceof WinAnsiEncoding )
// use PDFont encoding
return super.encode( bytes, offset, length );

This resolves the encoding problems for text extraction.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

pdfbox-833.patch
14/Apr/13 22:21
11 kB
Luis Bernardo
sample.pdf
14/Apr/13 22:21
313 kB
Luis Bernardo
sample1-fixed.png
14/Apr/13 22:21
162 kB
Luis Bernardo
sample1-original.png
14/Apr/13 22:21
157 kB
Luis Bernardo
simpleh2.pdf
17/May/13 08:03
12 kB
Simon Steiner

Issue Links

is duplicated by

PDFBOX-1506 Incorrect visualization of PDF document via PageDrawer

Closed

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Timo Boehme

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Sep/10 13:39

Updated:: 17/Mar/16 19:08

Resolved:: 11/Aug/13 15:24