Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Not A Problem
-
1.7.0
-
None
-
None
-
None
Description
I have an example PDF (will attach) that uses a right-single-quote
character, but extracts incorrectly from PDFBox (using ExtractText).
If I copy/paste, the text is correct (I get U+2019 for the right
quote).
Search for "cashier" in the PDF, on page 1 to see it; that right quote
is supposed to come through as U+2019 I think.
I looked at the PDF in PDFDebugger, and I see this fragment in the
"Contents" for page 1:
(Bring the voucher handout to the cashier\325s office (10-180))Tj
So somehow this \325 escape fails to map to the quoteright glyph. The
font is partial embedded font BPOLKO+TimesNewRomanPSMT, and I can see
in the Charset (under FontDescriptor, for font F1) that it references
this glyph.
I also see a [correct] entry in glyphlist.txt, mapping to U+2019, so
that's not the problem.
Not sure what's going wrong... maybe somehow \325 fails to map to
quoteright?
There are other glyphs (quotedblright, quotedblleft) that are also not
converted correctly, eg search for project review on page 2.