Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4758

Text Extractor does not handle common typographic ligatures

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Bug
    • Affects Version/s: 2.0.1, 2.0.18
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None

      Description

      TextExtractor mishandles typographic ligatures. I've attached test documents from both Microsoft Word and LibreOffice.

      I've checked PDFBox's output against xPDF on CentOS, and the ligatures are properly handled with that utililty, so it appears that this is a PDFBox defect.

        Attachments

        1. libreoffice-ligatures-test.pdf
          16 kB
          Michael Reynolds
        2. msword-ligatures-test.pdf
          26 kB
          Michael Reynolds
        3. TestExtractText.java
          4 kB
          Michael Reynolds

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              reynoldsm88 Michael Reynolds

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment