Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2035

Ignore badly formatted toUnicode CMaps

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.8.4, 2.0.0
    • 1.8.5, 2.0.0
    • Parsing, PDModel
    • None

    Description

      Copied from PDFBOX-399:

      Submitting a patch for ignoring badly-formatted CMap ToUnicode instructions.
      This allows parsing of some ToUnicode resource streams that would otherwise throw exceptions which were silently consumed. This allows text extraction to get the correctly mapped characters.

      Specifically parse token<hex> adjacency without whitespace separating them, eat all whitespace within a hex value, and return a partially constructed CMap instead of throwing an exception.

      I don't see a problem with the previous test case example (BlackHat...) but I've modified the test case based on an example from the wild: http://www.itsix.com/media/experienced_java_developer.pdf

      edit: forgot to mention that this patch was designed on 1.8.3, but also worked on trunk.

      Attachments

        1. experienced_java_developer.pdf
          2.19 MB
          Andreas Lehmkühler
        2. Ignore_badly-formatted_CMap_ToUnicode_instructions.patch
          6 kB
          Andreas Lehmkühler

        Activity

          People

            lehmi Andreas Lehmkühler
            cheng@indeed.com Cheng Leong
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: