PDFBox
  1. PDFBox
  2. PDFBOX-139

The CMapParser does not recognize essential cmap operators

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Parsing
    • Labels:
      None

      Description

      [imported from SourceForge]
      http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1438028
      Originally submitted by vdimchev on 2006-02-24 03:48.

      The bug is directly related to the following bug I
      discovered in the database:
      [ 1208652 ] PDFTextStripper.writeText Exception:Unknown
      encoding for ..

      I'll try to exlain it again here and supply enough
      resources for its fix.

      The problem is that the current implementation of
      CMapParser class supports only the beginbfchar and
      beginbfrange operators.

      This is not enough and causes the invokation to
      PDFTextStripper.writeText() to throw IOException with
      the following message: Unknown encoding for 'Identity-
      V'.
      I also managed to produce the message: "Unknown
      encoding for '90ms-RKSJ-H'.

      The complete stacktrace is:
      java.io.IOException: Unknown encoding for 'Identity-V'
      at org.pdfbox.encoding.EncodingManager.
      getEncoding(EncodingManager.java:83)
      at org.pdfbox.pdmodel.font.PDFont.
      getEncoding(PDFont.java:627)
      at org.pdfbox.pdmodel.font.PDFont.
      encode(PDFont.java:476)
      at org.pdfbox.util.PDFStreamEngine.
      showString(PDFStreamEngine.java:332)
      at org.pdfbox.util.operator.ShowText.
      process(ShowText.java:66)
      at org.pdfbox.util.PDFStreamEngine.
      processOperator(PDFStreamEngine.java:494)
      at org.pdfbox.util.PDFStreamEngine.
      processSubStream(PDFStreamEngine.java:207)
      at org.pdfbox.util.PDFStreamEngine.
      processStream(PDFStreamEngine.java:160)
      at org.pdfbox.util.PDFTextStripper.
      processPage(PDFTextStripper.java:355)
      at org.pdfbox.util.PDFTextStripper.
      processPages(PDFTextStripper.java:268)
      at org.pdfbox.util.PDFTextStripper.
      writeText(PDFTextStripper.java:220)

      In fact the cause of this exception is that the
      CMapParser does not recognize the begincidchar and
      begincidrange operators (in the case of the 90ms-RKSJ-
      H) encoding and usecmap operator in the case of
      Identity-V encoding.

      The cmap files for these encodings are not properly
      parsed and the corresponding Cmap objects do not
      contain neither one nor two byte mappings, further the
      lookup() method returns null.

      I'll attach two samples for the 90ms-RKSJ-H encoding
      and one for the Identity-V encoding.

      I'll attach cmap reference also.

      [attachment on SourceForge]
      http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1438028&file_id=168711
      5014.CIDFont_Spec.rar (application/octet-stream), 240282 bytes
      Reference, containing CMAP description

      [attachment on SourceForge]
      http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1438028&file_id=168709
      ken1.pdf (application/pdf), 33713 bytes
      The Identity-V sample

      [attachment on SourceForge]
      http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1438028&file_id=168708
      tp0404-2a.pdf (application/pdf), 11434 bytes
      The second 90ms-RKSJ-H sample

      [attachment on SourceForge]
      http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1438028&file_id=168705
      nan_youkou.pdf (application/pdf), 7663 bytes
      The first 90ms-RKSJ-H sample

        Activity

        Anonymous created issue -
        Hide
        Jonck van der Kogel added a comment -

        Does anyone know of a work-around for the time being? This bug is really annoying.

        Show
        Jonck van der Kogel added a comment - Does anyone know of a work-around for the time being? This bug is really annoying.
        Hide
        Daniel Wilson added a comment -

        I haven't done this on THIS item, but really there are 2 options:

        1. Implement the missing items. I'm not intimately familiar with that part of the project, so can't say what that would involve.

        2. Catch the error and continue. Likely something will be skipped, but the rest of the PDF, presumably, could deliver its text.

        BTW, which version are you using? Between 0.7.3 and the trunk version, we've added a LOT of error-trapping.

        Show
        Daniel Wilson added a comment - I haven't done this on THIS item, but really there are 2 options: 1. Implement the missing items. I'm not intimately familiar with that part of the project, so can't say what that would involve. 2. Catch the error and continue. Likely something will be skipped, but the rest of the PDF, presumably, could deliver its text. BTW, which version are you using? Between 0.7.3 and the trunk version, we've added a LOT of error-trapping.
        Jukka Zitting made changes -
        Field Original Value New Value
        Priority Major [ 3 ]
        Hide
        Andreas Lehmkühler added a comment -

        The CMap-Parser was improved a long time ago and now all mentioned operators are supported.

        Set to closed.

        Show
        Andreas Lehmkühler added a comment - The CMap-Parser was improved a long time ago and now all mentioned operators are supported. Set to closed.
        Andreas Lehmkühler made changes -
        Status Open [ 1 ] Closed [ 6 ]
        Assignee Andreas Lehmkühler [ lehmi ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            Andreas Lehmkühler
            Reporter:
            Anonymous
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development