Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5599 Support latin ligatures
  3. PDFBOX-5600

applyGsubFeature() doesn't use the longest possible replacement

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.0.0 PDFBox
    • 3.0.0 PDFBox
    • FontBox

    Description

      While working on latin ligatures I noticed that in words like "affluent" only "ff" was caught but not "ffl".

      CompoundCharacterTokenizer calls getRegexFromTokens which returns Strings like

      (_79_99_)|(_80_99_)|(_92_99_)
      

      and makes a regexp out of that.

      tokenize finds its match with find(), but not neccessarly the longest.

      Thus getRegexFromTokens should sort by reverse length the set that is used by CompoundCharacterTokenizer. I'm solving this with a custom TreeSet in getMatchersAsStrings.
      This will of course make everything slower; in the long run, maybe we should rewrite the code so that it doesn't use the regexp logic (although it's a smart idea!), but only after we have more real world test coverage.

      Attachments

        Activity

          People

            tilman Tilman Hausherr
            tilman Tilman Hausherr
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: