Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5387

ToUnicodeWriter.writeTo allows byte overflow in bfrange operator

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.25
    • 2.0.26, 3.0.0 PDFBox
    • PDModel
    • None

    Description

      The writeTo method of ToUnicodeWriter allows overflow in the low-order byte when writing the (begin/end)bfrange operator.

      As far as I can tell it is used only with the PDCIDFontType2Embedder class. I believe the bug exists in both the main trunk and in the 2.x branch. The code in question may be found here .

      The portion of the PDF specification (version 1.7) that bears upon this code is Section 5.9, Example 5.16.

      The existing code attempts to limit the range logic to changes less than or equal to 255 code points, but it fails to account for at least the following situation by allowing this (for example):

      [srcCode1 srcCode2 dstString]
      03FF 0400 0036

      The overflow between srcCode1 and srcCode2 is not allowed by the specification and any text extraction will fail. The glyphs themselves render fine so it is not immediately obvious there is a problem until one tries to examine the text by using the Content Panel or by copy/pasting from Acrobat (Pro) to some other document. By contrast the following bfrange operator does allow the text extraction to work as intended:

      [srcCode1 srcCode2 dstString]
      03FE 03FF 0035

      Notice that no overflow exists, and as such the requirements of the specification are met.

      I have put together a proposed solution here in my fork of the PDFBox GH mirror.

       

      Attachments

        Activity

          People

            lehmi Andreas Lehmkühler
            ryan.jackson Ryan Jackson
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: