Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3858

Ligatures convert on text extraction

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 2.4.1
    • None
    • parser
    • win 8, jre 1.5

    Description

      It appears that the issue in TIKA-1289 is still present. Ligatures get replaced by a question mark.

      As a particular example, the ft ligature is getting replaced by utf-8: ef bf  bd

      Is there any new resolution on this issue? Just returning the fl ligature would be great, or normalizing it to f, t.

      This particular example comes from saving my gmail inbox page as a pdf, in chrome. It uses the ft ligature in the word "Drafts".

      There are many similar examples, it's not specific to one pdf generator. 

      I'm using tika-app-2.4.1.jar 

      Attachments

        1. TikaChromeInboxLigature.pdf
          120 kB
          tom hill

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tom_eg tom hill
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: