Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4758

Text Extractor does not handle common typographic ligatures

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Bug
    • 2.0.1, 2.0.18
    • None
    • Text extraction
    • None

    Description

      TextExtractor mishandles typographic ligatures. I've attached test documents from both Microsoft Word and LibreOffice.

      I've checked PDFBox's output against xPDF on CentOS, and the ligatures are properly handled with that utililty, so it appears that this is a PDFBox defect.

      Attachments

        1. TestExtractText.java
          4 kB
          Michael Reynolds
        2. msword-ligatures-test.pdf
          26 kB
          Michael Reynolds
        3. libreoffice-ligatures-test.pdf
          16 kB
          Michael Reynolds

        Activity

          People

            Unassigned Unassigned
            reynoldsm88 Michael Reynolds
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: