Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4293

PDFBox does not align "columns" properly

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Won't Fix
    • Affects Version/s: 2.0.11
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      Windows 7 64

      Description

       I have to convert Pdf's to database data. I developed a parser that reads .txt files. The original data is available in PDFs only . Therefore .txt files will have to be created by Tika converting the PDF's to .txt. After conversion I recognise an alignment issue with the .txt data compared to the columns in the PDF. On the TIKA website I read that I need to check if the problems also occurs in PDFBox, so I checked for that. PDFBox has the same issue.

      These lines of PDF data:
      a b c d e
      a b c d e

      are both presented as
      a b c d e

      in the text file, causing for example numbers to be presented in the wrong "column".

      Unfortunately I cannot share busniess documents, but i have created an example in Excel, saved it as PDF and converted it to .txt. See attachments.

      In addition I converted the testset online with Convertio.co. Their results is as expected, with enough spaces between the words/numbers to recognise the column.

       

        Attachments

        1. PDFconversieTekst CONVERTIO.txt
          0.3 kB
          Rens Huizenga
        2. PDFconversieTekst.pdf
          100 kB
          Rens Huizenga
        3. PDFconversieTekst.pdf.txt
          0.2 kB
          Rens Huizenga
        4. PDFconversieTekst.xlsx
          9 kB
          Rens Huizenga

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Rens Rens Huizenga
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: