Uploaded image for project: 'FOP'
  1. FOP
  2. FOP-2701

Some of the latin ligatures make text not searchable in PDF

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.1
    • None
    • font/opentype
    • None
    • Windows 10, Calibri font.
    • Important

    Description

      This problem happens using the Calibri font, that is packed in the MS Office suite and Windows 10.

      I tested with the following text: file settings.
      The resulted PDF text contains ligatures: (fi)le se(tti)ngs

      Searching for file in Acrobat Reader results in the first word being selected. This is Ok. But searching for set, or settings gives no results.

      The same example, run with Antenna House works fine, you get results when searching for settings.

      Here is the complete FO file:

      <?xml version="1.0" encoding="UTF-8"?>
      <fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">
          <fo:layout-master-set>
              <fo:simple-page-master master-name="a">
                  <fo:region-body/>
              </fo:simple-page-master>
          </fo:layout-master-set>
          <fo:page-sequence master-reference="a">
              <fo:flow flow-name="xsl-region-body">
                  <fo:block font-family="Calibri" font-size="40pt">file settings</fo:block>
              </fo:flow>
          </fo:page-sequence>
      </fo:root>
      

      Some considerations:

      1. A workaround would be to reject all the substitutions that are not part of org.apache.fop.fonts.type1.AdobeStandardEncoding. This would leave the (fi) ligature, but reject the (tti) one. But this seems to work only for Calibri and not for Roboto!!
      2. I think there might be some issues with the font embedding, and some substitution mapping data is lost. It is just a guess, I am not sure how PDF deals with substitutions.

      I know that setting in FO xml:lang to "en" disables the ligatures, but is not a solution for my project. I would appreciate any suggestions.

      Attachments

        1. latn-ligatures-Antenna-House.pdf
          42 kB
          Dan Caprioara
        2. latn-ligatures-FOP.pdf
          21 kB
          Dan Caprioara
        3. test.fo
          0.5 kB
          J Frank
        4. out.pdf
          19 kB
          J Frank
        5. fop.xconf
          0.2 kB
          J Frank
        6. image-2022-05-31-15-50-26-058.png
          2 kB
          J Frank
        7. image-2022-05-31-15-50-39-029.png
          2 kB
          J Frank
        8. image-2022-05-31-15-52-01-435.png
          2 kB
          J Frank
        9. 3-fonts-latn-ligatures-FOP.fo
          1 kB
          Martin Hönings
        10. 3-fonts-copy-paste-result.png
          7 kB
          Martin Hönings
        11. 3-fonts-latn-ligatures-FOP.pdf
          1.86 MB
          Martin Hönings
        12. 3-fonts-fop.xconf
          1 kB
          Martin Hönings
        13. Screenshot 2022-06-07 092013.png
          175 kB
          Martin Hönings
        14. image-2022-06-07-15-31-01-526.png
          30 kB
          J Frank
        15. Screenshot 2022-06-08 074532.png
          34 kB
          Martin Hönings
        16. fop-1.xconf
          0.6 kB
          J Frank
        17. test-1.fo
          0.7 kB
          J Frank
        18. test-2.fo
          0.8 kB
          J Frank
        19. fop-2.xconf
          0.6 kB
          J Frank
        20. out-1.pdf
          64 kB
          J Frank

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            dc33 Dan Caprioara

            Dates

              Created:
              Updated:

              Slack

                Issue deployment