Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.0.19
-
None
Description
I missed a rendering change (sorry) in the linked PDF.js issue that happened in PDFBOX-4810 but it is not a regression, rather a difference in displaying a bad input due to having different data.
The CMap has these ranges:
4 begincodespacerange <00><7f> <c080><dfbf> <e08080><efbfbf> <f0808080><f7bfbfbf> endcodespacerange
The content stream has segments like
(Check\340up Date:2020/ 3/ 4 11:46) Tj
0340 is 0xE0. The current code at CMap.readCode() reads bytes until a range fits, and this means it reads 4 bytes until it noticed that this has failed. After the failure it doesn't reposition. So this is displayed as "Check ·Date" instead of "Check -up Date", i.e. input is lost. The "·" is the default glyph.
The solution is to remember the position and to reposition there. I'm using mark() and reset() which, surprisingly, works both when loading in memory and when loading with temp file.