Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-2451

Only gibberish extracted from certain PDF files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Cannot Reproduce
    • None
    • None
    • None
    • None

    Description

      I was told to report a bug here. There are problems with extracting text from PDF files in Dutch. The bug was reported in issue TIKA-1095 (https://issues.apache.org/jira/browse/TIKA-1095). The problem can be reproduced with the latest Tika version.

      The extracted Text only shows gibberish (or in other cases question marks and incorrect characters).

      It was suggested it could be a font problem. Could this be looked into?

      Attachments

        1. tika-other-document.png
          49 kB
          Stefan Postema
        2. tika-metadata.png
          53 kB
          Stefan Postema
        3. tika-formatted-text.png
          26 kB
          Stefan Postema
        4. ALG 2010-05-19 03 bijlage 1 - besluitenlijst dagelijks bestuur d d 10 februari 2010.pdf
          126 kB
          Stefan Postema

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Postie Stefan Postema
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: