[PDFBOX-449] Decomposed extended Latin Characters not normalized - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.8.0-incubator
Component/s: Text extraction
Labels:
None

Description

03_2_SSL.pdf file has the unicode U+00a8 character which when extracted does not get placed over the previous character. U+0308 is required to do this. This issue applies to most diacritics found in PDF files since pdf writers use the non-combining forms of the diacritics instead of the combining forms. This issue also applies to two files located in the regression test. These are Garcia2004_thesis.pdf and cweb.pdf.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Diacritic_Merge_fix_1.4Comp.diff
03/Apr/09 17:03
17 kB
Justin LeFebvre

Issue Links

is related to

PDFBOX-1622 TextNormalize init not thread-safe, may lead to infinite loop

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Justin LeFebvre

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 01/Apr/09 20:01

Updated:: 04/Jun/13 17:15

Resolved:: 07/Apr/09 16:25