Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3719

pdfbox parses spaces as tabs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 1.8.13
    • None
    • Text extraction
    • None

    Description

      i converted this pdf from the attached word document "DummyDoc.docx"

      then when using pdfbox1.8 to extract text
      java -jar pdfbox-app-1.8.13.jar ExtractText "DummyDoc.pdf" txt.txt

      and the generated is

      Dummy document for tag extraction

      Section 1


      DummyTagOne_01
      This is text body one


      DummyTagOne_02
      This is text body two

      Section 2
      DummyTagTwo_01
      This is text body three


      DummyTagTwo_02
      This is text body four


      DummyTagTwo_03
      This is text body five

      as you can see "This is text body one " instead of "This is text body one " and so on

      Attachments

        1. DummyDoc.docx
          17 kB
          Ahmed Eltayeb
        2. DummyDoc.pdf
          16 kB
          Ahmed Eltayeb

        Activity

          People

            Unassigned Unassigned
            aeltayeb Ahmed Eltayeb
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: