Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Won't Fix
-
2.0.11
-
None
-
None
-
Windows 7 64
Description
I have to convert Pdf's to database data. I developed a parser that reads .txt files. The original data is available in PDFs only . Therefore .txt files will have to be created by Tika converting the PDF's to .txt. After conversion I recognise an alignment issue with the .txt data compared to the columns in the PDF. On the TIKA website I read that I need to check if the problems also occurs in PDFBox, so I checked for that. PDFBox has the same issue.
These lines of PDF data:
a b c d e
a b c d e
are both presented as
a b c d e
in the text file, causing for example numbers to be presented in the wrong "column".
Unfortunately I cannot share busniess documents, but i have created an example in Excel, saved it as PDF and converted it to .txt. See attachments.
In addition I converted the testset online with Convertio.co. Their results is as expected, with enough spaces between the words/numbers to recognise the column.