[PDFBOX-5600] applyGsubFeature() doesn't use the longest possible replacement - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0 PDFBox
Fix Version/s: 3.0.0 PDFBox
Component/s: FontBox
Labels:
- gsub

Description

While working on latin ligatures I noticed that in words like "affluent" only "ff" was caught but not "ffl".

CompoundCharacterTokenizer calls getRegexFromTokens which returns Strings like

(_79_99_)|(_80_99_)|(_92_99_)

and makes a regexp out of that.

tokenize finds its match with find(), but not neccessarly the longest.

Thus getRegexFromTokens should sort by reverse length the set that is used by CompoundCharacterTokenizer. I'm solving this with a custom TreeSet in getMatchersAsStrings.
This will of course make everything slower; in the long run, maybe we should rewrite the code so that it doesn't use the regexp logic (although it's a smart idea!), but only after we have more real world test coverage.

Attachments

Activity

People

Assignee:: Tilman Hausherr

Reporter:: Tilman Hausherr

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/May/23 18:13

Updated:: 18/Aug/23 05:46

Resolved:: 11/May/23 18:23