Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5350

Regression unicode mapping in Korean document

    XMLWordPrintableJSON

Details

    Description

      The text output from Korean Patent PDFs changed in v2.0.15+ (due to unicode mapping?), this was previously addressed in PDFBOX-4661 and resolved that example PDF in v2.0.18 - thanks. Unfortunately v2.0.18 causes other documents (included here) to now have incorrect text output.

      PDFTextStripper stripper = new PDFTextStripper();
      PDDocument doc = PDDocument.load(new File("KR1019980000128.pdf"));
      stripper.getText(doc);

      Like in PDFBOX-4661 there are numerous warnings of the form:

      WARNING: No Unicode mapping for CID+14172 (14172) in font GKYPPJ+GulimChe

      I've attached the text dump of two versions, but in brief:

      2.0.15: 공개번호 (public number)

      2.0.25: 공개 

      I only confirmed the issue in the versions listed above but presume the issue persists >=2.0.18.

      My reading of PDFBOX-4661 is there is something funky about these PDFs? PDFBOX v2.0.15 had the correct text output. Testing in PDF.js incorrectly produces 공개뮈픸 so I can see there is something non-trivial here.

      Any help is much appreciated.

      Attachments

        1. KR1019900015076.pdf
          60 kB
          John Mayfield
        2. KR1019980000128.pdf
          114 kB
          John Mayfield
        3. KR1020140140600.pdf
          227 kB
          John Mayfield
        4. KR1019980000128_2_0_25.txt
          8 kB
          John Mayfield
        5. KR1019980000128_2_0_15.txt
          12 kB
          John Mayfield
        6. reports_pdfbox_2.0.29_vs_3.0.0.tar.xz
          1.31 MB
          Tilman Hausherr
        7. JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_strict.txt
          6 kB
          John Mayfield
        8. JX57O5E5YG6XM4FZABPULQGTW4OXPCWA.pdf_cmap_non-strict.txt
          9 kB
          John Mayfield
        9. FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_strict.txt
          38 kB
          John Mayfield
        10. FQJR7LPGFZDCDX7A3FEBVQRHNHZ7XDQ2.pdf_cmap_non-strict.txt
          60 kB
          John Mayfield
        11. 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_strict.txt
          439 kB
          John Mayfield
        12. 7VL5DM7XOQC6AP4H6KL3XC7CYQ547FFQ.pdf_cmap_non-strict.txt
          739 kB
          John Mayfield
        13. PDFBOX-5350-JX57O5E5YG6XM4FZABPULQGTW4OXPCWA-p1-reduced.pdf
          47 kB
          Tilman Hausherr

        Issue Links

          Activity

            People

              tilman Tilman Hausherr
              jwmayfield John Mayfield
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: