[PDFBOX-4720] cmap entries "<0000> <FFFF> <0000>" are cut - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.17
Fix Version/s: 2.0.19, 3.0.0 PDFBox
Component/s: FontBox, Text extraction
Labels:
- regression

Description

Parts of the text extraction is lost in the attached files because of the ToUnicode stream. Entries like

<0000> <FFFF> <0000>

are cut at 256 elements, likely from ~~PDFBOX-4661~~ and previous ones. While such an entry is incorrect, I think it should still be accepted when it's exactly that one.

Several such files have popped up in the last regression tests; I analysed only this one but it explains why I saw so many "foreign" differences: the ascii codes are OK, but not the "very special" characters.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

OLZD4CBMION7ZUV3M62BYWWIQJIS3IA3.pdf
20/Dec/19 19:49
422 kB
Tilman Hausherr
OLZD4CBMION7ZUV3M62BYWWIQJIS3IA3-reduced.pdf
20/Dec/19 19:49
180 kB
Tilman Hausherr

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Tilman Hausherr

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Dec/19 19:52

Updated:: 30/Jun/20 06:13

Resolved:: 25/Dec/19 11:17