Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-833

Wrong encoding with Type1C font when specific encoding is defined

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.3.1
    • 2.0.0
    • Parsing
    • None

    Description

      The Type1C font implementation overwrites the encoding() method of PDFont base class. This results in a lookup of codes to characters as defined in the font.
      However if an encoding is explicitly given (like WinAnsiEncoding) this leads to wrong results if encoding codes do not match glyph codes.
      In a test document (which unfortunately I cannot make public - an article from Elsevier) a Type1C font is embedded which defines a copyright sign at glyph position 259. The encoding is defines as WinAnsiEncoding. Text characters are defined corresponding to the WinAnsiEncoding. In case of the copyright sign it is 0xa9 (169) where the font has glyph 'quotesingle' defined.
      Since currently I have no other test cases I implemented following workaround for WinAnsiEncoding (which might be relaxed to other PDF encodings as well:
      in PDType1CFont.encode() I start with:

      if ( getEncoding() instanceof WinAnsiEncoding )
      // use PDFont encoding
      return super.encode( bytes, offset, length );

      This resolves the encoding problems for text extraction.

      Attachments

        1. pdfbox-833.patch
          11 kB
          Luis Bernardo
        2. sample.pdf
          313 kB
          Luis Bernardo
        3. sample1-fixed.png
          162 kB
          Luis Bernardo
        4. sample1-original.png
          157 kB
          Luis Bernardo
        5. simpleh2.pdf
          12 kB
          Simon Steiner

        Issue Links

          Activity

            People

              lehmi Andreas Lehmkühler
              tboehme Timo Boehme
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: