Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3858

Ligatures convert on text extraction

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 2.4.1
    • None
    • parser
    • win 8, jre 1.5

    Description

      It appears that the issue in TIKA-1289 is still present. Ligatures get replaced by a question mark.

      As a particular example, the ft ligature is getting replaced by utf-8: ef bf  bd

      Is there any new resolution on this issue? Just returning the fl ligature would be great, or normalizing it to f, t.

      This particular example comes from saving my gmail inbox page as a pdf, in chrome. It uses the ft ligature in the word "Drafts".

      There are many similar examples, it's not specific to one pdf generator. 

      I'm using tika-app-2.4.1.jar 

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            tom_eg tom hill
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment