Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5540

export:text creates jibberish / malformed output

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.0.16, 2.0.27, 3.0.0 PDFBox
    • 2.0.28, 3.0.0 PDFBox
    • Text extraction
    • Same on Windows, Linux and macOS

    Description

      Using PDFBox as part of Tika and having issues with some PDFs outputting unreadable content. Copying text from Adobe / macOS Preview / Browsers works as expected.

      I have also tried "re-encoding" the PDF by editing and saving it with Acrobat, thinking it could be an issue with their original PDF creator and using pdfbox with different encodings, but output mostly remained unchanged.

      I attached the PDF and text it produces. Running it PDFBox via CLI as follows:

      root % java -jar pdfbox-app-3.0.0-alpha3.jar export:text -i test.pdf          
      Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
      WARNUNG: Invalid ToUnicode CMap in font 
      Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
      WARNUNG: Using predefined identity CMap instead
      Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
      WARNUNG: Invalid ToUnicode CMap in font 
      Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
      WARNUNG: Using predefined identity CMap instead
      Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
      WARNUNG: Invalid ToUnicode CMap in font 
      Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
      WARNUNG: Using predefined identity CMap instead
      Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
      WARNUNG: Invalid ToUnicode CMap in font 
      Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
      WARNUNG: Using predefined identity CMap instead 

      Attachments

        1. test.txt
          1 kB
          Alfons
        2. test.pdf
          157 kB
          Alfons
        3. PDFBOX-5540.pdf.txt
          1 kB
          Tilman Hausherr

        Activity

          People

            tilman Tilman Hausherr
            alfons Alfons
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: