Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4758

Text Extractor does not handle common typographic ligatures

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Bug
    • Affects Version/s: 2.0.1, 2.0.18
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None

      Description

      TextExtractor mishandles typographic ligatures. I've attached test documents from both Microsoft Word and LibreOffice.

      I've checked PDFBox's output against xPDF on CentOS, and the ligatures are properly handled with that utililty, so it appears that this is a PDFBox defect.

        Attachments

        1. TestExtractText.java
          4 kB
          Michael Reynolds
        2. msword-ligatures-test.pdf
          26 kB
          Michael Reynolds
        3. libreoffice-ligatures-test.pdf
          16 kB
          Michael Reynolds

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              reynoldsm88 Michael Reynolds
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: