Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
3.0.0 PDFBox
Description
While working on latin ligatures I noticed that in words like "affluent" only "ff" was caught but not "ffl".
CompoundCharacterTokenizer calls getRegexFromTokens which returns Strings like
(_79_99_)|(_80_99_)|(_92_99_)
and makes a regexp out of that.
tokenize finds its match with find(), but not neccessarly the longest.
Thus getRegexFromTokens should sort by reverse length the set that is used by CompoundCharacterTokenizer. I'm solving this with a custom TreeSet in getMatchersAsStrings.
This will of course make everything slower; in the long run, maybe we should rewrite the code so that it doesn't use the regexp logic (although it's a smart idea!), but only after we have more real world test coverage.