[PDFBOX-449] Decomposed extended Latin Characters not normalized - ASF JIRA

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.8.0-incubator
Component/s: Text extraction
Labels:
None

Description

03_2_SSL.pdf file has the unicode U+00a8 character which when extracted does not get placed over the previous character. U+0308 is required to do this. This issue applies to most diacritics found in PDF files since pdf writers use the non-combining forms of the diacritics instead of the combining forms. This issue also applies to two files located in the regression test. These are Garcia2004_thesis.pdf and cweb.pdf.