Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1129

Quote glyphs (quoteright, quotedblright, etc.) not mapped to the right Unicode character

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Not A Problem
    • 1.7.0
    • None
    • None
    • None

    Description

      I have an example PDF (will attach) that uses a right-single-quote
      character, but extracts incorrectly from PDFBox (using ExtractText).
      If I copy/paste, the text is correct (I get U+2019 for the right
      quote).

      Search for "cashier" in the PDF, on page 1 to see it; that right quote
      is supposed to come through as U+2019 I think.

      I looked at the PDF in PDFDebugger, and I see this fragment in the
      "Contents" for page 1:

      (Bring the voucher handout to the cashier\325s office (10-180))Tj

      So somehow this \325 escape fails to map to the quoteright glyph. The
      font is partial embedded font BPOLKO+TimesNewRomanPSMT, and I can see
      in the Charset (under FontDescriptor, for font F1) that it references
      this glyph.

      I also see a [correct] entry in glyphlist.txt, mapping to U+2019, so
      that's not the problem.

      Not sure what's going wrong... maybe somehow \325 fails to map to
      quoteright?

      There are other glyphs (quotedblright, quotedblleft) that are also not
      converted correctly, eg search for project review on page 2.

      Attachments

        1. 000086.pdf
          41 kB
          Michael McCandless

        Activity

          People

            lehmi Andreas Lehmkühler
            mikemccand Michael McCandless
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: