Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5875

using font data to process ligatures

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      To process ligatures from Asian languages (where a glyph is the combination of two unicode characters) using the data in embedded fonts.

       

      The problem:

      currently modern PDF creators put these ligatures in /ActualText field which we only recently considered to support in this issue . But this is not the case in old PDFs with embedded CID fonts like page.pdf where the glyphs of ligatures lack a /toUnicode character mapping because there is no single unicode codepoint for these as these are combination of more than one unicode characters. 

       

      The Potential Solution (if not perfect): 

      I managed to extract the font files using pdfbox (code) and when i viewed the fontfiles using fontforge i found the data about ligatures intact in it. So we can use this data to map the glyphs that are ligatures to the unicodes of its constituent glyphs

       

      Problems:

      In some cases the constituent glyphs may not be present in the cmap at all. removed by PDF optimiser as it is never directly used in the PDF apart from in ligatures. such glyphs are empty with only glyph id and no /toUnicode mapping even if that particular glyph has a corresponding unicode character.

       

      The Hope:

      This is not a common problem in large PDFs. and basic spell checkers could easily rectify the problem. some comprehension is better than no comprehension when it comes to dealing with data. this will greatly enhance the parsing of non-Latin Asian languages.

       

      (the PDF sample i attached is in Tamil language)

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            manish003 Manish S N

            Dates

              Created:
              Updated:

              Slack

                Issue deployment