[PDFBOX-5350] Regression unicode mapping in Korean document - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.16, 2.0.18, 2.0.20, 2.0.25, 3.0.0 PDFBox
Fix Version/s: 2.0.30, 3.0.1 PDFBox, 4.0.0
Component/s: Text extraction
Labels:
- regression

Description

The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode mapping?), this was previously addressed in ~~PDFBOX-4661~~ and resolved that example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents (included here) to now have incorrect text output.

PDFTextStripper stripper = new PDFTextStripper();
PDDocument doc = PDDocument.load(new File("KR1019980000128.pdf"));
stripper.getText(doc);

Like in ~~PDFBOX-4661~~ there are numerous warnings of the form:

WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe

I've attached the text dump of two versions, but in brief:

2.0.15: 공개번호 (public number)

2.0.25: 공개

I only confirmed the issue in the versions listed above but presume the issue persists >=2.0.18.

My reading of ~~PDFBOX-4661~~ is there is something funky about these PDFs? PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly produces 공개뮈픸 so I can see there is something non-trivial here.

Any help is much appreciated.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

reports_pdfbox_2.0.29_vs_3.0.0.tar.xz
03/Oct/23 09:07
1.31 MB
Tilman Hausherr
PDFBOX-5350-JX57O5E5YG6XM4FZABPULQGTW4OXPCWA-p1-reduced.pdf
08/Oct/23 09:07
47 kB
Tilman Hausherr
KR1020140140600.pdf
21/Dec/21 21:29
227 kB
John Mayfield
KR1019980000128.pdf
21/Dec/21 21:29
114 kB
John Mayfield
KR1019980000128_2_0_25.txt
21/Dec/21 21:40
8 kB
John Mayfield
KR1019980000128_2_0_15.txt
21/Dec/21 21:40
12 kB
John Mayfield
KR1019900015076.pdf
21/Dec/21 21:29
60 kB
John Mayfield
JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_strict.txt
04/Oct/23 10:21
6 kB
John Mayfield
JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_non-strict.txt
04/Oct/23 10:21
9 kB
John Mayfield
FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_strict.txt
04/Oct/23 10:21
38 kB
John Mayfield
FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_non-strict.txt
04/Oct/23 10:21
60 kB
John Mayfield
7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_strict.txt
04/Oct/23 10:21
439 kB
John Mayfield
7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_non-strict.txt
04/Oct/23 10:21
739 kB
John Mayfield

Issue Links

relates to

PDFBOX-4661 Regression No Unicode mapping with Identity-H font

Closed

Activity

People

Assignee:: Tilman Hausherr

Reporter:: John Mayfield

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 21/Dec/21 21:51

Updated:: 05/Nov/23 11:33

Resolved:: 08/Oct/23 12:23