Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4293

PDFBox does not align "columns" properly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Won't Fix
    • 2.0.11
    • None
    • Text extraction
    • None
    • Windows 7 64

    Description

       I have to convert Pdf's to database data. I developed a parser that reads .txt files. The original data is available in PDFs only . Therefore .txt files will have to be created by Tika converting the PDF's to .txt. After conversion I recognise an alignment issue with the .txt data compared to the columns in the PDF. On the TIKA website I read that I need to check if the problems also occurs in PDFBox, so I checked for that. PDFBox has the same issue.

      These lines of PDF data:
      a b c d e
      a b c d e

      are both presented as
      a b c d e

      in the text file, causing for example numbers to be presented in the wrong "column".

      Unfortunately I cannot share busniess documents, but i have created an example in Excel, saved it as PDF and converted it to .txt. See attachments.

      In addition I converted the testset online with Convertio.co. Their results is as expected, with enough spaces between the words/numbers to recognise the column.

       

      Attachments

        1. PDFconversieTekst.xlsx
          9 kB
          Rens Huizenga
        2. PDFconversieTekst.pdf.txt
          0.2 kB
          Rens Huizenga
        3. PDFconversieTekst.pdf
          100 kB
          Rens Huizenga
        4. PDFconversieTekst CONVERTIO.txt
          0.3 kB
          Rens Huizenga

        Activity

          People

            Unassigned Unassigned
            Rens Rens Huizenga
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: