Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
2.4.1
-
None
-
win 8, jre 1.5
Description
It appears that the issue in TIKA-1289 is still present. Ligatures get replaced by a question mark.
As a particular example, the ft ligature is getting replaced by utf-8: ef bf bd
Is there any new resolution on this issue? Just returning the fl ligature would be great, or normalizing it to f, t.
This particular example comes from saving my gmail inbox page as a pdf, in chrome. It uses the ft ligature in the word "Drafts".
There are many similar examples, it's not specific to one pdf generator.
I'm using tika-app-2.4.1.jar
Attachments
Attachments
Issue Links
- is a clone of
-
TIKA-1289 Ligatures convert on text extraction
- Resolved
- is duplicated by
-
PDFBOX-4532 PDFTextStripper replacing the decimal with white space
- Open