Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-139

The CMapParser does not recognize essential cmap operators

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • None
    • Parsing
    • None

    Description

      [imported from SourceForge]
      http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1438028
      Originally submitted by vdimchev on 2006-02-24 03:48.

      The bug is directly related to the following bug I
      discovered in the database:
      [ 1208652 ] PDFTextStripper.writeText Exception:Unknown
      encoding for ..

      I'll try to exlain it again here and supply enough
      resources for its fix.

      The problem is that the current implementation of
      CMapParser class supports only the beginbfchar and
      beginbfrange operators.

      This is not enough and causes the invokation to
      PDFTextStripper.writeText() to throw IOException with
      the following message: Unknown encoding for 'Identity-
      V'.
      I also managed to produce the message: "Unknown
      encoding for '90ms-RKSJ-H'.

      The complete stacktrace is:
      java.io.IOException: Unknown encoding for 'Identity-V'
      at org.pdfbox.encoding.EncodingManager.
      getEncoding(EncodingManager.java:83)
      at org.pdfbox.pdmodel.font.PDFont.
      getEncoding(PDFont.java:627)
      at org.pdfbox.pdmodel.font.PDFont.
      encode(PDFont.java:476)
      at org.pdfbox.util.PDFStreamEngine.
      showString(PDFStreamEngine.java:332)
      at org.pdfbox.util.operator.ShowText.
      process(ShowText.java:66)
      at org.pdfbox.util.PDFStreamEngine.
      processOperator(PDFStreamEngine.java:494)
      at org.pdfbox.util.PDFStreamEngine.
      processSubStream(PDFStreamEngine.java:207)
      at org.pdfbox.util.PDFStreamEngine.
      processStream(PDFStreamEngine.java:160)
      at org.pdfbox.util.PDFTextStripper.
      processPage(PDFTextStripper.java:355)
      at org.pdfbox.util.PDFTextStripper.
      processPages(PDFTextStripper.java:268)
      at org.pdfbox.util.PDFTextStripper.
      writeText(PDFTextStripper.java:220)

      In fact the cause of this exception is that the
      CMapParser does not recognize the begincidchar and
      begincidrange operators (in the case of the 90ms-RKSJ-
      H) encoding and usecmap operator in the case of
      Identity-V encoding.

      The cmap files for these encodings are not properly
      parsed and the corresponding Cmap objects do not
      contain neither one nor two byte mappings, further the
      lookup() method returns null.

      I'll attach two samples for the 90ms-RKSJ-H encoding
      and one for the Identity-V encoding.

      I'll attach cmap reference also.

      [attachment on SourceForge]
      http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1438028&file_id=168711
      5014.CIDFont_Spec.rar (application/octet-stream), 240282 bytes
      Reference, containing CMAP description

      [attachment on SourceForge]
      http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1438028&file_id=168709
      ken1.pdf (application/pdf), 33713 bytes
      The Identity-V sample

      [attachment on SourceForge]
      http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1438028&file_id=168708
      tp0404-2a.pdf (application/pdf), 11434 bytes
      The second 90ms-RKSJ-H sample

      [attachment on SourceForge]
      http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1438028&file_id=168705
      nan_youkou.pdf (application/pdf), 7663 bytes
      The first 90ms-RKSJ-H sample

      Attachments

        Activity

          People

            lehmi Andreas Lehmkühler
            Anonymous Anonymous
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: