Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.0.16, 2.0.18, 2.0.20, 2.0.25, 3.0.0 PDFBox
Description
The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode mapping?), this was previously addressed in PDFBOX-4661 and resolved that example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents (included here) to now have incorrect text output.
PDFTextStripper stripper = new PDFTextStripper();
PDDocument doc = PDDocument.load(new File("KR1019980000128.pdf"));
stripper.getText(doc);
Like in PDFBOX-4661 there are numerous warnings of the form:
WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe
I've attached the text dump of two versions, but in brief:
2.0.15: 공개번호 (public number)
2.0.25: 공개
I only confirmed the issue in the versions listed above but presume the issue persists >=2.0.18.
My reading of PDFBOX-4661 is there is something funky about these PDFs? PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly produces 공개뮈픸 so I can see there is something non-trivial here.
Any help is much appreciated.
Attachments
Attachments
Issue Links
- relates to
-
PDFBOX-4661 Regression No Unicode mapping with Identity-H font
- Closed