[PDFBOX-3719] pdfbox parses spaces as tabs - ASF JIRA

XML

Word

Printable

JSON

i converted this pdf from the attached word document "DummyDoc.docx"

then when using pdfbox1.8 to extract text
java -jar pdfbox-app-1.8.13.jar ExtractText "DummyDoc.pdf" txt.txt

and the generated is

Dummy document for tag extraction

Section 1

DummyTagOne_01
This is text body one

DummyTagOne_02
This is text body two

Section 2
DummyTagTwo_01
This is text body three

DummyTagTwo_02
This is text body four

DummyTagTwo_03
This is text body five

as you can see "This is text body one " instead of "This is text body one " and so on